2013-09-16 11:26:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 0/9] split page table lock for PMD tables

Alex Thorlton noticed that some massivly threaded workloads work poorly,
if THP enabled. This patchset fixes this by introducing split page table
lock for PMD tables. hugetlbfs is not covered yet.

This patchset is based on work by Naoya Horiguchi.

Changes:
v2:
- reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock;
- s/huge_pmd_lock/pmd_lock/g;
- assume pgtable_pmd_page_ctor() can fail;
- fix format line in task_mem() for VmPTE;

Benchmark (from Alex): ftp://shell.sgi.com/collect/appsx_test/pthread_test.tar.gz
Run on 4 socket Westmere with 128 GiB of RAM.

THP off:
--------

Performance counter stats for './thp_pthread -C 0 -m 0 -c 80 -b 100g' (5 runs):

1738259.808012 task-clock # 47.571 CPUs utilized ( +- 9.49% )
147,359 context-switches # 0.085 K/sec ( +- 9.67% )
14 cpu-migrations # 0.000 K/sec ( +- 13.25% )
24,410,139 page-faults # 0.014 M/sec ( +- 0.00% )
4,149,037,526,252 cycles # 2.387 GHz ( +- 9.50% )
3,649,839,735,027 stalled-cycles-frontend # 87.97% frontend cycles idle ( +- 6.60% )
2,455,558,969,567 stalled-cycles-backend # 59.18% backend cycles idle ( +- 22.92% )
1,434,961,518,604 instructions # 0.35 insns per cycle
# 2.54 stalled cycles per insn ( +- 92.86% )
241,472,020,951 branches # 138.916 M/sec ( +- 91.72% )
84,022,172 branch-misses # 0.03% of all branches ( +- 3.16% )

36.540185552 seconds time elapsed ( +- 18.36% )

THP on, no patchset:
--------------------
Performance counter stats for './thp_pthread -C 0 -m 0 -c 80 -b 100g' (5 runs):

2528378.966949 task-clock # 50.715 CPUs utilized ( +- 11.86% )
214,063 context-switches # 0.085 K/sec ( +- 11.94% )
19 cpu-migrations # 0.000 K/sec ( +- 22.72% )
49,226 page-faults # 0.019 K/sec ( +- 0.33% )
6,034,640,598,498 cycles # 2.387 GHz ( +- 11.91% )
5,685,933,794,081 stalled-cycles-frontend # 94.22% frontend cycles idle ( +- 7.67% )
4,414,381,393,353 stalled-cycles-backend # 73.15% backend cycles idle ( +- 2.09% )
952,086,804,776 instructions # 0.16 insns per cycle
# 5.97 stalled cycles per insn ( +- 89.59% )
166,191,211,974 branches # 65.730 M/sec ( +- 85.52% )
33,341,022 branch-misses # 0.02% of all branches ( +- 3.90% )

49.854741504 seconds time elapsed ( +- 14.76% )

THP on, with patchset:
----------------------

echo always > /sys/kernel/mm/transparent_hugepage/enabled
Performance counter stats for './thp_pthread -C 0 -m 0 -c 80 -b 100g' (5 runs):

1538763.343568 task-clock # 45.386 CPUs utilized ( +- 7.21% )
130,469 context-switches # 0.085 K/sec ( +- 7.32% )
14 cpu-migrations # 0.000 K/sec ( +- 23.58% )
49,299 page-faults # 0.032 K/sec ( +- 0.15% )
3,666,748,502,650 cycles # 2.383 GHz ( +- 7.25% )
3,330,488,035,212 stalled-cycles-frontend # 90.83% frontend cycles idle ( +- 4.70% )
2,383,357,073,990 stalled-cycles-backend # 65.00% backend cycles idle ( +- 16.06% )
935,504,610,528 instructions # 0.26 insns per cycle
# 3.56 stalled cycles per insn ( +- 91.16% )
161,466,689,532 branches # 104.933 M/sec ( +- 87.67% )
22,602,225 branch-misses # 0.01% of all branches ( +- 6.43% )

33.903917543 seconds time elapsed ( +- 12.57% )

Kirill A. Shutemov (9):
mm: rename USE_SPLIT_PTLOCKS to USE_SPLIT_PTE_PTLOCKS
mm: convert mm->nr_ptes to atomic_t
mm: introduce api for split page table lock for PMD level
mm, thp: change pmd_trans_huge_lock() to return taken lock
mm, thp: move ptl taking inside page_check_address_pmd()
mm, thp: do not access mm->pmd_huge_pte directly
mm: convent the rest to new page table lock api
mm: implement split page table lock for PMD level
x86, mm: enable split page table lock for PMD level

arch/arm/mm/fault-armv.c | 6 +-
arch/s390/mm/pgtable.c | 12 +--
arch/sparc/mm/tlb.c | 12 +--
arch/x86/Kconfig | 4 +
arch/x86/include/asm/pgalloc.h | 11 ++-
arch/x86/xen/mmu.c | 6 +-
fs/proc/task_mmu.c | 17 ++--
include/linux/huge_mm.h | 17 ++--
include/linux/mm.h | 52 ++++++++++-
include/linux/mm_types.h | 18 ++--
kernel/fork.c | 6 +-
mm/Kconfig | 3 +
mm/huge_memory.c | 201 ++++++++++++++++++++++++-----------------
mm/memcontrol.c | 10 +-
mm/memory.c | 21 +++--
mm/migrate.c | 7 +-
mm/mmap.c | 3 +-
mm/mprotect.c | 4 +-
mm/oom_kill.c | 6 +-
mm/pgtable-generic.c | 16 ++--
mm/rmap.c | 13 +--
21 files changed, 276 insertions(+), 169 deletions(-)

--
1.8.4.rc3


2013-09-16 11:26:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 3/9] mm: introduce api for split page table lock for PMD level

Basic api, backed by mm->page_table_lock for now. Actual implementation
will be added later.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6cf8ddb..e3481c6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1294,6 +1294,19 @@ static inline void pgtable_page_dtor(struct page *page)
((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, address))? \
NULL: pte_offset_kernel(pmd, address))

+static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+ return &mm->page_table_lock;
+}
+
+
+static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
+{
+ spinlock_t *ptl = pmd_lockptr(mm, pmd);
+ spin_lock(ptl);
+ return ptl;
+}
+
extern void free_area_init(unsigned long * zones_size);
extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
--
1.8.4.rc3

2013-09-16 11:25:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 4/9] mm, thp: change pmd_trans_huge_lock() to return taken lock

With split ptlock it's important to know which lock pmd_trans_huge_lock()
took. This patch adds one more parameter to the function to return the
lock.

In most places migration to new api is trivial.
Exception is move_huge_pmd(): we need to take two locks if pmd tables
are different.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/proc/task_mmu.c | 13 +++++++------
include/linux/huge_mm.h | 14 +++++++-------
mm/huge_memory.c | 40 +++++++++++++++++++++++++++-------------
mm/memcontrol.c | 10 +++++-----
4 files changed, 46 insertions(+), 31 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8e124ac..4599519 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -505,9 +505,9 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pte_t *pte;
spinlock_t *ptl;

- if (pmd_trans_huge_lock(pmd, vma) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
- spin_unlock(&walk->mm->page_table_lock);
+ spin_unlock(ptl);
mss->anonymous_thp += HPAGE_PMD_SIZE;
return 0;
}
@@ -993,13 +993,14 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
{
struct vm_area_struct *vma;
struct pagemapread *pm = walk->private;
+ spinlock_t *ptl;
pte_t *pte;
int err = 0;
pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));

/* find the first VMA at or above 'addr' */
vma = find_vma(walk->mm, addr);
- if (vma && pmd_trans_huge_lock(pmd, vma) == 1) {
+ if (vma && pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
int pmd_flags2;

if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
@@ -1017,7 +1018,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
if (err)
break;
}
- spin_unlock(&walk->mm->page_table_lock);
+ spin_unlock(ptl);
return err;
}

@@ -1319,7 +1320,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,

md = walk->private;

- if (pmd_trans_huge_lock(pmd, md->vma) == 1) {
+ if (pmd_trans_huge_lock(pmd, md->vma, &ptl) == 1) {
pte_t huge_pte = *(pte_t *)pmd;
struct page *page;

@@ -1327,7 +1328,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
if (page)
gather_stats(page, md, pte_dirty(huge_pte),
HPAGE_PMD_SIZE/PAGE_SIZE);
- spin_unlock(&walk->mm->page_table_lock);
+ spin_unlock(ptl);
return 0;
}

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3935428..4aca0d8 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -129,15 +129,15 @@ extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
long adjust_next);
-extern int __pmd_trans_huge_lock(pmd_t *pmd,
- struct vm_area_struct *vma);
+extern int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+ spinlock_t **ptl);
/* mmap_sem must be held on entry */
-static inline int pmd_trans_huge_lock(pmd_t *pmd,
- struct vm_area_struct *vma)
+static inline int pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+ spinlock_t **ptl)
{
VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
if (pmd_trans_huge(*pmd))
- return __pmd_trans_huge_lock(pmd, vma);
+ return __pmd_trans_huge_lock(pmd, vma, ptl);
else
return 0;
}
@@ -215,8 +215,8 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
long adjust_next)
{
}
-static inline int pmd_trans_huge_lock(pmd_t *pmd,
- struct vm_area_struct *vma)
+static inline int pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+ spinlock_t **ptl)
{
return 0;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bbd41a2..59a1340 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1342,9 +1342,10 @@ out_unlock:
int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr)
{
+ spinlock_t *ptl;
int ret = 0;

- if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+ if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
struct page *page;
pgtable_t pgtable;
pmd_t orig_pmd;
@@ -1359,7 +1360,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
if (is_huge_zero_pmd(orig_pmd)) {
atomic_dec(&tlb->mm->nr_ptes);
- spin_unlock(&tlb->mm->page_table_lock);
+ spin_unlock(ptl);
put_huge_zero_page();
} else {
page = pmd_page(orig_pmd);
@@ -1368,7 +1369,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
VM_BUG_ON(!PageHead(page));
atomic_dec(&tlb->mm->nr_ptes);
- spin_unlock(&tlb->mm->page_table_lock);
+ spin_unlock(ptl);
tlb_remove_page(tlb, page);
}
pte_free(tlb->mm, pgtable);
@@ -1381,14 +1382,15 @@ int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
unsigned char *vec)
{
+ spinlock_t *ptl;
int ret = 0;

- if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+ if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
/*
* All logical pages in the range are present
* if backed by a huge page.
*/
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(ptl);
memset(vec, 1, (end - addr) >> PAGE_SHIFT);
ret = 1;
}
@@ -1401,6 +1403,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long old_end,
pmd_t *old_pmd, pmd_t *new_pmd)
{
+ spinlock_t *old_ptl, *new_ptl;
int ret = 0;
pmd_t pmd;

@@ -1421,12 +1424,21 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
goto out;
}

- ret = __pmd_trans_huge_lock(old_pmd, vma);
+ /*
+ * We don't have to worry about the ordering of src and dst
+ * ptlocks because exclusive mmap_sem prevents deadlock.
+ */
+ ret = __pmd_trans_huge_lock(old_pmd, vma, &old_ptl);
if (ret == 1) {
+ new_ptl = pmd_lockptr(mm, new_pmd);
+ if (new_ptl != old_ptl)
+ spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
VM_BUG_ON(!pmd_none(*new_pmd));
set_pmd_at(mm, new_addr, new_pmd, pmd_mksoft_dirty(pmd));
- spin_unlock(&mm->page_table_lock);
+ if (new_ptl != old_ptl)
+ spin_unlock(new_ptl);
+ spin_unlock(old_ptl);
}
out:
return ret;
@@ -1436,9 +1448,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot, int prot_numa)
{
struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
int ret = 0;

- if (__pmd_trans_huge_lock(pmd, vma) == 1) {
+ if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pmd_t entry;
entry = pmdp_get_and_clear(mm, addr, pmd);
if (!prot_numa) {
@@ -1454,7 +1467,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
}
}
set_pmd_at(mm, addr, pmd, entry);
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(ptl);
ret = 1;
}

@@ -1468,12 +1481,13 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
* Note that if it returns 1, this routine returns without unlocking page
* table locks. So callers must unlock them.
*/
-int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
+int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+ spinlock_t **ptl)
{
- spin_lock(&vma->vm_mm->page_table_lock);
+ *ptl = pmd_lock(vma->vm_mm, pmd);
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(*ptl);
wait_split_huge_page(vma->anon_vma, pmd);
return -1;
} else {
@@ -1482,7 +1496,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
return 1;
}
}
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(*ptl);
return 0;
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5ff3ce..5f35b2a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6376,10 +6376,10 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
pte_t *pte;
spinlock_t *ptl;

- if (pmd_trans_huge_lock(pmd, vma) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
mc.precharge += HPAGE_PMD_NR;
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(ptl);
return 0;
}

@@ -6568,9 +6568,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
* to be unlocked in __split_huge_page_splitting(), where the main
* part of thp split is not executed yet.
*/
- if (pmd_trans_huge_lock(pmd, vma) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
if (mc.precharge < HPAGE_PMD_NR) {
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(ptl);
return 0;
}
target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
@@ -6587,7 +6587,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
}
put_page(page);
}
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(ptl);
return 0;
}

--
1.8.4.rc3

2013-09-16 11:26:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 5/9] mm, thp: move ptl taking inside page_check_address_pmd()

With split page table lock we can't know which lock we need to take
before we find the relevant pmd.

Let's move lock taking inside the function.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 3 ++-
mm/huge_memory.c | 43 +++++++++++++++++++++++++++----------------
mm/rmap.c | 13 +++++--------
3 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4aca0d8..91672e2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -54,7 +54,8 @@ enum page_check_address_pmd_flag {
extern pmd_t *page_check_address_pmd(struct page *page,
struct mm_struct *mm,
unsigned long address,
- enum page_check_address_pmd_flag flag);
+ enum page_check_address_pmd_flag flag,
+ spinlock_t **ptl);

#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 59a1340..3a1f5c1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1500,23 +1500,33 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
return 0;
}

+/*
+ * This function returns whether a given @page is mapped onto the @address
+ * in the virtual space of @mm.
+ *
+ * When it's true, this function returns *pmd with holding the page table lock
+ * and passing it back to the caller via @ptl.
+ * If it's false, returns NULL without holding the page table lock.
+ */
pmd_t *page_check_address_pmd(struct page *page,
struct mm_struct *mm,
unsigned long address,
- enum page_check_address_pmd_flag flag)
+ enum page_check_address_pmd_flag flag,
+ spinlock_t **ptl)
{
- pmd_t *pmd, *ret = NULL;
+ pmd_t *pmd;

if (address & ~HPAGE_PMD_MASK)
- goto out;
+ return NULL;

pmd = mm_find_pmd(mm, address);
if (!pmd)
- goto out;
+ return NULL;
+ *ptl = pmd_lock(mm, pmd);
if (pmd_none(*pmd))
- goto out;
+ goto unlock;
if (pmd_page(*pmd) != page)
- goto out;
+ goto unlock;
/*
* split_vma() may create temporary aliased mappings. There is
* no risk as long as all huge pmd are found and have their
@@ -1526,14 +1536,15 @@ pmd_t *page_check_address_pmd(struct page *page,
*/
if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
pmd_trans_splitting(*pmd))
- goto out;
+ goto unlock;
if (pmd_trans_huge(*pmd)) {
VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
!pmd_trans_splitting(*pmd));
- ret = pmd;
+ return pmd;
}
-out:
- return ret;
+unlock:
+ spin_unlock(*ptl);
+ return NULL;
}

static int __split_huge_page_splitting(struct page *page,
@@ -1541,6 +1552,7 @@ static int __split_huge_page_splitting(struct page *page,
unsigned long address)
{
struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
pmd_t *pmd;
int ret = 0;
/* For mmu_notifiers */
@@ -1548,9 +1560,8 @@ static int __split_huge_page_splitting(struct page *page,
const unsigned long mmun_end = address + HPAGE_PMD_SIZE;

mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
- spin_lock(&mm->page_table_lock);
pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
+ PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
if (pmd) {
/*
* We can't temporarily set the pmd to null in order
@@ -1561,8 +1572,8 @@ static int __split_huge_page_splitting(struct page *page,
*/
pmdp_splitting_flush(vma, address, pmd);
ret = 1;
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

return ret;
@@ -1693,14 +1704,14 @@ static int __split_huge_page_map(struct page *page,
unsigned long address)
{
struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
pmd_t *pmd, _pmd;
int ret = 0, i;
pgtable_t pgtable;
unsigned long haddr;

- spin_lock(&mm->page_table_lock);
pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
+ PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
if (pmd) {
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);
@@ -1755,8 +1766,8 @@ static int __split_huge_page_map(struct page *page,
pmdp_invalidate(vma, address, pmd);
pmd_populate(mm, pmd, pgtable);
ret = 1;
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);

return ret;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index fd3ee7a..b59d741 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -665,25 +665,23 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
unsigned long *vm_flags)
{
struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
int referenced = 0;

if (unlikely(PageTransHuge(page))) {
pmd_t *pmd;

- spin_lock(&mm->page_table_lock);
/*
* rmap might return false positives; we must filter
* these out using page_check_address_pmd().
*/
pmd = page_check_address_pmd(page, mm, address,
- PAGE_CHECK_ADDRESS_PMD_FLAG);
- if (!pmd) {
- spin_unlock(&mm->page_table_lock);
+ PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+ if (!pmd)
goto out;
- }

if (vma->vm_flags & VM_LOCKED) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
*mapcount = 0; /* break early from loop */
*vm_flags |= VM_LOCKED;
goto out;
@@ -692,10 +690,9 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
/* go ahead even if the pmd is pmd_trans_splitting() */
if (pmdp_clear_flush_young_notify(vma, address, pmd))
referenced++;
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
} else {
pte_t *pte;
- spinlock_t *ptl;

/*
* rmap might return false positives; we must filter
--
1.8.4.rc3

2013-09-16 11:26:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 6/9] mm, thp: do not access mm->pmd_huge_pte directly

Currently mm->pmd_huge_pte protected by page table lock. It will not
work with split lock. We have to have per-pmd pmd_huge_pte for proper
access serialization.

For now, let's just introduce wrapper to access mm->pmd_huge_pte.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/s390/mm/pgtable.c | 12 ++++++------
arch/sparc/mm/tlb.c | 12 ++++++------
include/linux/mm.h | 1 +
mm/pgtable-generic.c | 12 ++++++------
4 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index de8cbc3..c463e5c 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -1225,11 +1225,11 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
assert_spin_locked(&mm->page_table_lock);

/* FIFO */
- if (!mm->pmd_huge_pte)
+ if (!pmd_huge_pte(mm, pmdp))
INIT_LIST_HEAD(lh);
else
- list_add(lh, (struct list_head *) mm->pmd_huge_pte);
- mm->pmd_huge_pte = pgtable;
+ list_add(lh, (struct list_head *) pmd_huge_pte(mm, pmdp));
+ pmd_huge_pte(mm, pmdp) = pgtable;
}

pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
@@ -1241,12 +1241,12 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
assert_spin_locked(&mm->page_table_lock);

/* FIFO */
- pgtable = mm->pmd_huge_pte;
+ pgtable = pmd_huge_pte(mm, pmdp);
lh = (struct list_head *) pgtable;
if (list_empty(lh))
- mm->pmd_huge_pte = NULL;
+ pmd_huge_pte(mm, pmdp) = NULL;
else {
- mm->pmd_huge_pte = (pgtable_t) lh->next;
+ pmd_huge_pte(mm, pmdp) = (pgtable_t) lh->next;
list_del(lh);
}
ptep = (pte_t *) pgtable;
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 7a91f28..656cc46 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -196,11 +196,11 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
assert_spin_locked(&mm->page_table_lock);

/* FIFO */
- if (!mm->pmd_huge_pte)
+ if (!pmd_huge_pte(mm, pmdp))
INIT_LIST_HEAD(lh);
else
- list_add(lh, (struct list_head *) mm->pmd_huge_pte);
- mm->pmd_huge_pte = pgtable;
+ list_add(lh, (struct list_head *) pmd_huge_pte(mm, pmdp));
+ pmd_huge_pte(mm, pmdp) = pgtable;
}

pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
@@ -211,12 +211,12 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
assert_spin_locked(&mm->page_table_lock);

/* FIFO */
- pgtable = mm->pmd_huge_pte;
+ pgtable = pmd_huge_pte(mm, pmdp);
lh = (struct list_head *) pgtable;
if (list_empty(lh))
- mm->pmd_huge_pte = NULL;
+ pmd_huge_pte(mm, pmdp) = NULL;
else {
- mm->pmd_huge_pte = (pgtable_t) lh->next;
+ pmd_huge_pte(mm, pmdp) = (pgtable_t) lh->next;
list_del(lh);
}
pte_val(pgtable[0]) = 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e3481c6..b96aac9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1299,6 +1299,7 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
return &mm->page_table_lock;
}

+#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)

static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
{
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 3929a40..41fee3e 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -154,11 +154,11 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
assert_spin_locked(&mm->page_table_lock);

/* FIFO */
- if (!mm->pmd_huge_pte)
+ if (!pmd_huge_pte(mm, pmdp))
INIT_LIST_HEAD(&pgtable->lru);
else
- list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
- mm->pmd_huge_pte = pgtable;
+ list_add(&pgtable->lru, &pmd_huge_pte(mm, pmdp)->lru);
+ pmd_huge_pte(mm, pmdp) = pgtable;
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif
@@ -173,11 +173,11 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
assert_spin_locked(&mm->page_table_lock);

/* FIFO */
- pgtable = mm->pmd_huge_pte;
+ pgtable = pmd_huge_pte(mm, pmdp);
if (list_empty(&pgtable->lru))
- mm->pmd_huge_pte = NULL;
+ pmd_huge_pte(mm, pmdp) = NULL;
else {
- mm->pmd_huge_pte = list_entry(pgtable->lru.next,
+ pmd_huge_pte(mm, pmdp) = list_entry(pgtable->lru.next,
struct page, lru);
list_del(&pgtable->lru);
}
--
1.8.4.rc3

2013-09-16 11:26:17

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 8/9] mm: implement split page table lock for PMD level

The basic idea is the same as with PTE level: the lock is embedded into
struct page of table's page.

We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
take mm->page_table_lock anymore. Let's reuse page->lru of table's page
for that.

hugetlbfs hasn't converted to split locking: disable split locking if
hugetlbfs enabled.

pgtable_pmd_page_ctor() returns true, if initialization is successful
and false otherwise. Current implementation never fails, but assumption
that constructor can fail will help to port it to -rt where spinlock_t
is rather huge and cannot be embedded into struct page -- dynamic
allocation is required.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 32 ++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 8 +++++++-
kernel/fork.c | 4 ++--
mm/Kconfig | 3 +++
4 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b96aac9..75735f6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1294,13 +1294,45 @@ static inline void pgtable_page_dtor(struct page *page)
((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, address))? \
NULL: pte_offset_kernel(pmd, address))

+#if USE_SPLIT_PMD_PTLOCKS
+
+static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+ return &virt_to_page(pmd)->ptl;
+}
+
+static inline bool pgtable_pmd_page_ctor(struct page *page)
+{
+ spin_lock_init(&page->ptl);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ page->pmd_huge_pte = NULL;
+#endif
+ return true;
+}
+
+static inline void pgtable_pmd_page_dtor(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ VM_BUG_ON(page->pmd_huge_pte);
+#endif
+}
+
+#define pmd_huge_pte(mm, pmd) (virt_to_page(pmd)->pmd_huge_pte)
+
+#else
+
static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
{
return &mm->page_table_lock;
}

+static inline bool pgtable_pmd_page_ctor(struct page *page) { return true; }
+static inline void pgtable_pmd_page_dtor(struct page *page) {}
+
#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)

+#endif
+
static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
{
spinlock_t *ptl = pmd_lockptr(mm, pmd);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b17a909..94206cb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -24,6 +24,9 @@
struct address_space;

#define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
+/* hugetlb hasn't converted to split locking yet */
+#define USE_SPLIT_PMD_PTLOCKS (USE_SPLIT_PTE_PTLOCKS && \
+ CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK && !CONFIG_HUGETLB_PAGE)

/*
* Each physical page in the system has a struct page associated with
@@ -130,6 +133,9 @@ struct page {

struct list_head list; /* slobs list of pages */
struct slab *slab_page; /* slab fields */
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
+ pgtable_t pmd_huge_pte; /* protected by page->ptl */
+#endif
};

/* Remainder is not double word aligned */
@@ -405,7 +411,7 @@ struct mm_struct {
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
#ifdef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/fork.c b/kernel/fork.c
index 4c8b986..1670af7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -560,7 +560,7 @@ static void check_mm(struct mm_struct *mm)
"mm:%p idx:%d val:%ld\n", mm, i, x);
}

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
VM_BUG_ON(mm->pmd_huge_pte);
#endif
}
@@ -814,7 +814,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
memcpy(mm, oldmm, sizeof(*mm));
mm_init_cpumask(mm);

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
mm->pmd_huge_pte = NULL;
#endif
#ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/Kconfig b/mm/Kconfig
index 026771a..89d56e3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -214,6 +214,9 @@ config SPLIT_PTLOCK_CPUS
default "999999" if DEBUG_SPINLOCK || DEBUG_LOCK_ALLOC
default "4"

+config ARCH_ENABLE_SPLIT_PMD_PTLOCK
+ boolean
+
#
# support for memory balloon compaction
config BALLOON_COMPACTION
--
1.8.4.rc3

2013-09-16 11:26:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 9/9] x86, mm: enable split page table lock for PMD level

Enable PMD split page table lock for X86_64 and PAE.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/Kconfig | 4 ++++
arch/x86/include/asm/pgalloc.h | 11 ++++++++++-
2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 30c40f0..6a5cf6a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1880,6 +1880,10 @@ config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA

+config ARCH_ENABLE_SPLIT_PMD_PTLOCK
+ def_bool y
+ depends on X86_64 || X86_PAE
+
menu "Power management and ACPI options"

config ARCH_HIBERNATION_HEADER
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index b4389a4..e2fb2b6 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -80,12 +80,21 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
#if PAGETABLE_LEVELS > 2
static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
{
- return (pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+ struct page *page;
+ page = alloc_pages(GFP_KERNEL | __GFP_REPEAT| __GFP_ZERO, 0);
+ if (!page)
+ return NULL;
+ if (!pgtable_pmd_page_ctor(page)) {
+ __free_pages(page, 0);
+ return NULL;
+ }
+ return (pmd_t *)page_address(page);
}

static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
{
BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
+ pgtable_pmd_page_dtor(virt_to_page(pmd));
free_page((unsigned long)pmd);
}

--
1.8.4.rc3

2013-09-16 11:27:17

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 2/9] mm: convert mm->nr_ptes to atomic_t

With split page table lock for PMD level we can't hold
mm->page_table_lock while updating nr_ptes.

Let's convert it to atomic_t to avoid races.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/proc/task_mmu.c | 4 ++--
include/linux/mm_types.h | 2 +-
kernel/fork.c | 2 +-
mm/huge_memory.c | 10 +++++-----
mm/memory.c | 4 ++--
mm/mmap.c | 3 ++-
mm/oom_kill.c | 6 +++---
7 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7366e9d..8e124ac 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -52,7 +52,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
"VmStk:\t%8lu kB\n"
"VmExe:\t%8lu kB\n"
"VmLib:\t%8lu kB\n"
- "VmPTE:\t%8lu kB\n"
+ "VmPTE:\t%8zd kB\n"
"VmSwap:\t%8lu kB\n",
hiwater_vm << (PAGE_SHIFT-10),
total_vm << (PAGE_SHIFT-10),
@@ -62,7 +62,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
total_rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
- (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10,
+ (PTRS_PER_PTE*sizeof(pte_t)*atomic_read(&mm->nr_ptes)) >> 10,
swap << (PAGE_SHIFT-10));
}

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c6af00c..b17a909 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -338,6 +338,7 @@ struct mm_struct {
pgd_t * pgd;
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
+ atomic_t nr_ptes; /* Page table pages */
int map_count; /* number of VMAs */

spinlock_t page_table_lock; /* Protects page tables and some counters */
@@ -359,7 +360,6 @@ struct mm_struct {
unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE */
unsigned long stack_vm; /* VM_GROWSUP/DOWN */
unsigned long def_flags;
- unsigned long nr_ptes; /* Page table pages */
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
diff --git a/kernel/fork.c b/kernel/fork.c
index 81ccb4f..4c8b986 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -532,7 +532,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
mm->flags = (current->mm) ?
(current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
mm->core_state = NULL;
- mm->nr_ptes = 0;
+ atomic_set(&mm->nr_ptes, 0);
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
spin_lock_init(&mm->page_table_lock);
mm_init_aio(mm);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7489884..bbd41a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -737,7 +737,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, haddr, pmd, entry);
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
- mm->nr_ptes++;
+ atomic_inc(&mm->nr_ptes);
spin_unlock(&mm->page_table_lock);
}

@@ -778,7 +778,7 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
entry = pmd_mkhuge(entry);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, haddr, pmd, entry);
- mm->nr_ptes++;
+ atomic_inc(&mm->nr_ptes);
return true;
}

@@ -903,7 +903,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd = pmd_mkold(pmd_wrprotect(pmd));
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
- dst_mm->nr_ptes++;
+ atomic_inc(&dst_mm->nr_ptes);

ret = 0;
out_unlock:
@@ -1358,7 +1358,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
if (is_huge_zero_pmd(orig_pmd)) {
- tlb->mm->nr_ptes--;
+ atomic_dec(&tlb->mm->nr_ptes);
spin_unlock(&tlb->mm->page_table_lock);
put_huge_zero_page();
} else {
@@ -1367,7 +1367,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
VM_BUG_ON(page_mapcount(page) < 0);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
VM_BUG_ON(!PageHead(page));
- tlb->mm->nr_ptes--;
+ atomic_dec(&tlb->mm->nr_ptes);
spin_unlock(&tlb->mm->page_table_lock);
tlb_remove_page(tlb, page);
}
diff --git a/mm/memory.c b/mm/memory.c
index ca00039..1046396 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -382,7 +382,7 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
pgtable_t token = pmd_pgtable(*pmd);
pmd_clear(pmd);
pte_free_tlb(tlb, token, addr);
- tlb->mm->nr_ptes--;
+ atomic_dec(&tlb->mm->nr_ptes);
}

static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -575,7 +575,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
spin_lock(&mm->page_table_lock);
wait_split_huge_page = 0;
if (likely(pmd_none(*pmd))) { /* Has another populated it ? */
- mm->nr_ptes++;
+ atomic_inc(&mm->nr_ptes);
pmd_populate(mm, pmd, new);
new = NULL;
} else if (unlikely(pmd_trans_splitting(*pmd)))
diff --git a/mm/mmap.c b/mm/mmap.c
index 9d54851..1d0efbc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2726,7 +2726,8 @@ void exit_mmap(struct mm_struct *mm)
}
vm_unacct_memory(nr_accounted);

- WARN_ON(mm->nr_ptes > (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);
+ WARN_ON(atomic_read(&mm->nr_ptes) >
+ (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);
}

/* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 314e9d2..7ab394e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -161,7 +161,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
* The baseline for the badness score is the proportion of RAM that each
* task's rss, pagetable and swap space use.
*/
- points = get_mm_rss(p->mm) + p->mm->nr_ptes +
+ points = get_mm_rss(p->mm) + atomic_read(&p->mm->nr_ptes) +
get_mm_counter(p->mm, MM_SWAPENTS);
task_unlock(p);

@@ -364,10 +364,10 @@ static void dump_tasks(const struct mem_cgroup *memcg, const nodemask_t *nodemas
continue;
}

- pr_info("[%5d] %5d %5d %8lu %8lu %7lu %8lu %5hd %s\n",
+ pr_info("[%5d] %5d %5d %8lu %8lu %7d %8lu %5hd %s\n",
task->pid, from_kuid(&init_user_ns, task_uid(task)),
task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
- task->mm->nr_ptes,
+ atomic_read(&task->mm->nr_ptes),
get_mm_counter(task->mm, MM_SWAPENTS),
task->signal->oom_score_adj, task->comm);
task_unlock(task);
--
1.8.4.rc3

2013-09-16 11:27:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 1/9] mm: rename USE_SPLIT_PTLOCKS to USE_SPLIT_PTE_PTLOCKS

We're going to introduce split page table lock for PMD level.
Let's rename existing split ptlock for PTE level to avoid confusion.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/arm/mm/fault-armv.c | 6 +++---
arch/x86/xen/mmu.c | 6 +++---
include/linux/mm.h | 6 +++---
include/linux/mm_types.h | 8 ++++----
4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index 2a5907b..ff379ac 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -65,7 +65,7 @@ static int do_adjust_pte(struct vm_area_struct *vma, unsigned long address,
return ret;
}

-#if USE_SPLIT_PTLOCKS
+#if USE_SPLIT_PTE_PTLOCKS
/*
* If we are using split PTE locks, then we need to take the page
* lock here. Otherwise we are using shared mm->page_table_lock
@@ -84,10 +84,10 @@ static inline void do_pte_unlock(spinlock_t *ptl)
{
spin_unlock(ptl);
}
-#else /* !USE_SPLIT_PTLOCKS */
+#else /* !USE_SPLIT_PTE_PTLOCKS */
static inline void do_pte_lock(spinlock_t *ptl) {}
static inline void do_pte_unlock(spinlock_t *ptl) {}
-#endif /* USE_SPLIT_PTLOCKS */
+#endif /* USE_SPLIT_PTE_PTLOCKS */

static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
unsigned long pfn)
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index fdc3ba2..455c873 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -796,7 +796,7 @@ static spinlock_t *xen_pte_lock(struct page *page, struct mm_struct *mm)
{
spinlock_t *ptl = NULL;

-#if USE_SPLIT_PTLOCKS
+#if USE_SPLIT_PTE_PTLOCKS
ptl = __pte_lockptr(page);
spin_lock_nest_lock(ptl, &mm->page_table_lock);
#endif
@@ -1637,7 +1637,7 @@ static inline void xen_alloc_ptpage(struct mm_struct *mm, unsigned long pfn,

__set_pfn_prot(pfn, PAGE_KERNEL_RO);

- if (level == PT_PTE && USE_SPLIT_PTLOCKS)
+ if (level == PT_PTE && USE_SPLIT_PTE_PTLOCKS)
__pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE, pfn);

xen_mc_issue(PARAVIRT_LAZY_MMU);
@@ -1671,7 +1671,7 @@ static inline void xen_release_ptpage(unsigned long pfn, unsigned level)
if (!PageHighMem(page)) {
xen_mc_batch();

- if (level == PT_PTE && USE_SPLIT_PTLOCKS)
+ if (level == PT_PTE && USE_SPLIT_PTE_PTLOCKS)
__pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, pfn);

__set_pfn_prot(pfn, PAGE_KERNEL);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..6cf8ddb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1232,7 +1232,7 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
}
#endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */

-#if USE_SPLIT_PTLOCKS
+#if USE_SPLIT_PTE_PTLOCKS
/*
* We tuck a spinlock to guard each pagetable page into its struct page,
* at page->private, with BUILD_BUG_ON to make sure that this will not
@@ -1245,14 +1245,14 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
} while (0)
#define pte_lock_deinit(page) ((page)->mapping = NULL)
#define pte_lockptr(mm, pmd) ({(void)(mm); __pte_lockptr(pmd_page(*(pmd)));})
-#else /* !USE_SPLIT_PTLOCKS */
+#else /* !USE_SPLIT_PTE_PTLOCKS */
/*
* We use mm->page_table_lock to guard all pagetable pages of the mm.
*/
#define pte_lock_init(page) do {} while (0)
#define pte_lock_deinit(page) do {} while (0)
#define pte_lockptr(mm, pmd) ({(void)(pmd); &(mm)->page_table_lock;})
-#endif /* USE_SPLIT_PTLOCKS */
+#endif /* USE_SPLIT_PTE_PTLOCKS */

static inline void pgtable_page_ctor(struct page *page)
{
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index faf4b7c..c6af00c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,7 +23,7 @@

struct address_space;

-#define USE_SPLIT_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
+#define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)

/*
* Each physical page in the system has a struct page associated with
@@ -141,7 +141,7 @@ struct page {
* indicates order in the buddy
* system if PG_buddy is set.
*/
-#if USE_SPLIT_PTLOCKS
+#if USE_SPLIT_PTE_PTLOCKS
spinlock_t ptl;
#endif
struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */
@@ -309,14 +309,14 @@ enum {
NR_MM_COUNTERS
};

-#if USE_SPLIT_PTLOCKS && defined(CONFIG_MMU)
+#if USE_SPLIT_PTE_PTLOCKS && defined(CONFIG_MMU)
#define SPLIT_RSS_COUNTING
/* per-thread cached information, */
struct task_rss_stat {
int events; /* for synchronization threshold */
int count[NR_MM_COUNTERS];
};
-#endif /* USE_SPLIT_PTLOCKS */
+#endif /* USE_SPLIT_PTE_PTLOCKS */

struct mm_rss_stat {
atomic_long_t count[NR_MM_COUNTERS];
--
1.8.4.rc3

2013-09-16 11:28:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 7/9] mm: convent the rest to new page table lock api

Only trivial cases left. Let's convert them altogether.

hugetlbfs is not covered for now.

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 108 ++++++++++++++++++++++++++++-----------------------
mm/memory.c | 17 ++++----
mm/migrate.c | 7 ++--
mm/mprotect.c | 4 +-
mm/pgtable-generic.c | 4 +-
5 files changed, 77 insertions(+), 63 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3a1f5c1..0d85512 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -709,6 +709,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct page *page)
{
pgtable_t pgtable;
+ spinlock_t *ptl;

VM_BUG_ON(!PageCompound(page));
pgtable = pte_alloc_one(mm, haddr);
@@ -723,9 +724,9 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
*/
__SetPageUptodate(page);

- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_none(*pmd))) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mem_cgroup_uncharge_page(page);
put_page(page);
pte_free(mm, pgtable);
@@ -738,7 +739,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
set_pmd_at(mm, haddr, pmd, entry);
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
atomic_inc(&mm->nr_ptes);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
}

return 0;
@@ -766,6 +767,7 @@ static inline struct page *alloc_hugepage(int defrag)
}
#endif

+/* Caller must hold page table lock. */
static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
struct page *zero_page)
@@ -797,6 +799,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (!(flags & FAULT_FLAG_WRITE) &&
transparent_hugepage_use_zero_page()) {
+ spinlock_t *ptl;
pgtable_t pgtable;
struct page *zero_page;
bool set;
@@ -809,10 +812,10 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
count_vm_event(THP_FAULT_FALLBACK);
return VM_FAULT_FALLBACK;
}
- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
zero_page);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
if (!set) {
pte_free(mm, pgtable);
put_huge_zero_page();
@@ -845,6 +848,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma)
{
+ spinlock_t *dst_ptl, *src_ptl;
struct page *src_page;
pmd_t pmd;
pgtable_t pgtable;
@@ -855,8 +859,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (unlikely(!pgtable))
goto out;

- spin_lock(&dst_mm->page_table_lock);
- spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
+ dst_ptl = pmd_lock(dst_mm, dst_pmd);
+ src_ptl = pmd_lockptr(src_mm, src_pmd);
+ spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);

ret = -EAGAIN;
pmd = *src_pmd;
@@ -865,7 +870,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_unlock;
}
/*
- * mm->page_table_lock is enough to be sure that huge zero pmd is not
+ * When page table lock is held, the huge zero pmd should not be
* under splitting since we don't split the page itself, only pmd to
* a page table.
*/
@@ -886,8 +891,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
}
if (unlikely(pmd_trans_splitting(pmd))) {
/* split huge page running from under us */
- spin_unlock(&src_mm->page_table_lock);
- spin_unlock(&dst_mm->page_table_lock);
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
pte_free(dst_mm, pgtable);

wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
@@ -907,8 +912,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,

ret = 0;
out_unlock:
- spin_unlock(&src_mm->page_table_lock);
- spin_unlock(&dst_mm->page_table_lock);
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
out:
return ret;
}
@@ -919,10 +924,11 @@ void huge_pmd_set_accessed(struct mm_struct *mm,
pmd_t *pmd, pmd_t orig_pmd,
int dirty)
{
+ spinlock_t *ptl;
pmd_t entry;
unsigned long haddr;

- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
goto unlock;

@@ -932,13 +938,14 @@ void huge_pmd_set_accessed(struct mm_struct *mm,
update_mmu_cache_pmd(vma, address, pmd);

unlock:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
}

static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd, pmd_t orig_pmd, unsigned long haddr)
{
+ spinlock_t *ptl;
pgtable_t pgtable;
pmd_t _pmd;
struct page *page;
@@ -965,7 +972,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
mmun_end = haddr + HPAGE_PMD_SIZE;
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
goto out_free_page;

@@ -992,7 +999,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
}
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
put_huge_zero_page();
inc_mm_counter(mm, MM_ANONPAGES);

@@ -1002,7 +1009,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
out:
return ret;
out_free_page:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
mem_cgroup_uncharge_page(page);
put_page(page);
@@ -1016,6 +1023,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct page *page,
unsigned long haddr)
{
+ spinlock_t *ptl;
pgtable_t pgtable;
pmd_t _pmd;
int ret = 0, i;
@@ -1062,7 +1070,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
mmun_end = haddr + HPAGE_PMD_SIZE;
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
goto out_free_pages;
VM_BUG_ON(!PageHead(page));
@@ -1088,7 +1096,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
page_remove_rmap(page);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);

mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

@@ -1099,7 +1107,7 @@ out:
return ret;

out_free_pages:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
mem_cgroup_uncharge_start();
for (i = 0; i < HPAGE_PMD_NR; i++) {
@@ -1114,17 +1122,19 @@ out_free_pages:
int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
{
+ spinlock_t *ptl;
int ret = 0;
struct page *page = NULL, *new_page;
unsigned long haddr;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */

+ ptl = pmd_lockptr(mm, pmd);
VM_BUG_ON(!vma->anon_vma);
haddr = address & HPAGE_PMD_MASK;
if (is_huge_zero_pmd(orig_pmd))
goto alloc;
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
goto out_unlock;

@@ -1140,7 +1150,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_unlock;
}
get_page(page);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
@@ -1187,11 +1197,11 @@ alloc:
mmun_end = haddr + HPAGE_PMD_SIZE;
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
if (page)
put_page(page);
if (unlikely(!pmd_same(*pmd, orig_pmd))) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mem_cgroup_uncharge_page(new_page);
put_page(new_page);
goto out_mn;
@@ -1213,13 +1223,13 @@ alloc:
}
ret |= VM_FAULT_WRITE;
}
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
out_mn:
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
out:
return ret;
out_unlock:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
return ret;
}

@@ -1231,7 +1241,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
struct page *page = NULL;

- assert_spin_locked(&mm->page_table_lock);
+ assert_spin_locked(pmd_lockptr(mm, pmd));

if (flags & FOLL_WRITE && !pmd_write(*pmd))
goto out;
@@ -1278,13 +1288,14 @@ out:
int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp)
{
+ spinlock_t *ptl;
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int target_nid;
int current_nid = -1;
bool migrated;

- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmdp);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;

@@ -1302,17 +1313,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

/* Acquire the page lock to serialise THP migrations */
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
lock_page(page);

/* Confirm the PTE did not while locked */
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
if (unlikely(!pmd_same(pmd, *pmdp))) {
unlock_page(page);
put_page(page);
goto out_unlock;
}
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);

/* Migrate the THP to the requested node */
migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1324,7 +1335,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;

check_same:
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;
clear_pmdnuma:
@@ -1333,7 +1344,7 @@ clear_pmdnuma:
VM_BUG_ON(pmd_numa(*pmdp));
update_mmu_cache_pmd(vma, addr, pmdp);
out_unlock:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
if (current_nid != -1)
task_numa_fault(current_nid, HPAGE_PMD_NR, false);
return 0;
@@ -2282,7 +2293,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pte_t *pte;
pgtable_t pgtable;
struct page *new_page;
- spinlock_t *ptl;
+ spinlock_t *pmd_ptl, *pte_ptl;
int isolated;
unsigned long hstart, hend;
unsigned long mmun_start; /* For mmu_notifiers */
@@ -2325,12 +2336,12 @@ static void collapse_huge_page(struct mm_struct *mm,
anon_vma_lock_write(vma->anon_vma);

pte = pte_offset_map(pmd, address);
- ptl = pte_lockptr(mm, pmd);
+ pte_ptl = pte_lockptr(mm, pmd);

mmun_start = address;
mmun_end = address + HPAGE_PMD_SIZE;
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
- spin_lock(&mm->page_table_lock); /* probably unnecessary */
+ pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
/*
* After this gup_fast can't run anymore. This also removes
* any huge TLB entry from the CPU so we won't allow
@@ -2338,16 +2349,16 @@ static void collapse_huge_page(struct mm_struct *mm,
* to avoid the risk of CPU bugs in that area.
*/
_pmd = pmdp_clear_flush(vma, address, pmd);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(pmd_ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

- spin_lock(ptl);
+ spin_lock(pte_ptl);
isolated = __collapse_huge_page_isolate(vma, address, pte);
- spin_unlock(ptl);
+ spin_unlock(pte_ptl);

if (unlikely(!isolated)) {
pte_unmap(pte);
- spin_lock(&mm->page_table_lock);
+ spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
/*
* We can only use set_pmd_at when establishing
@@ -2355,7 +2366,7 @@ static void collapse_huge_page(struct mm_struct *mm,
* points to regular pagetables. Use pmd_populate for that
*/
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(pmd_ptl);
anon_vma_unlock_write(vma->anon_vma);
goto out;
}
@@ -2366,7 +2377,7 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
anon_vma_unlock_write(vma->anon_vma);

- __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+ __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl);
pte_unmap(pte);
__SetPageUptodate(new_page);
pgtable = pmd_pgtable(_pmd);
@@ -2381,13 +2392,13 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
smp_wmb();

- spin_lock(&mm->page_table_lock);
+ spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
page_add_new_anon_rmap(new_page, vma, address);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, address, pmd, _pmd);
update_mmu_cache_pmd(vma, address, pmd);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(pmd_ptl);

*hpage = NULL;

@@ -2712,6 +2723,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd)
{
+ spinlock_t *ptl;
struct page *page;
struct mm_struct *mm = vma->vm_mm;
unsigned long haddr = address & HPAGE_PMD_MASK;
@@ -2723,22 +2735,22 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
mmun_start = haddr;
mmun_end = haddr + HPAGE_PMD_SIZE;
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_trans_huge(*pmd))) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
return;
}
if (is_huge_zero_pmd(*pmd)) {
__split_huge_zero_page_pmd(vma, haddr, pmd);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
return;
}
page = pmd_page(*pmd);
VM_BUG_ON(!page_count(page));
get_page(page);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

split_huge_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 1046396..551b15e3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -552,6 +552,7 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long address)
{
+ spinlock_t *ptl;
pgtable_t new = pte_alloc_one(mm, address);
int wait_split_huge_page;
if (!new)
@@ -572,7 +573,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
*/
smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */

- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
wait_split_huge_page = 0;
if (likely(pmd_none(*pmd))) { /* Has another populated it ? */
atomic_inc(&mm->nr_ptes);
@@ -580,7 +581,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
new = NULL;
} else if (unlikely(pmd_trans_splitting(*pmd)))
wait_split_huge_page = 1;
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
if (new)
pte_free(mm, new);
if (wait_split_huge_page)
@@ -1516,20 +1517,20 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
split_huge_page_pmd(vma, address, pmd);
goto split_fallthrough;
}
- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
wait_split_huge_page(vma->anon_vma, pmd);
} else {
page = follow_trans_huge_pmd(vma, address,
pmd, flags);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
*page_mask = HPAGE_PMD_NR - 1;
goto out;
}
} else
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
/* fall through */
}
split_fallthrough:
@@ -3602,13 +3603,13 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
bool numa = false;
int local_nid = numa_node_id();

- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmdp);
pmd = *pmdp;
if (pmd_numa(pmd)) {
set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
numa = true;
}
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);

if (!numa)
return 0;
diff --git a/mm/migrate.c b/mm/migrate.c
index b7ded7e..399d831 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1653,6 +1653,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
unsigned long address,
struct page *page, int node)
{
+ spinlock_t *ptl;
unsigned long haddr = address & HPAGE_PMD_MASK;
pg_data_t *pgdat = NODE_DATA(node);
int isolated = 0;
@@ -1699,9 +1700,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
WARN_ON(PageLRU(new_page));

/* Recheck the target PMD */
- spin_lock(&mm->page_table_lock);
+ ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_same(*pmd, entry))) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);

/* Reverse changes made by migrate_page_copy() */
if (TestClearPageActive(new_page))
@@ -1746,7 +1747,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
* before it's fully transferred to the new page.
*/
mem_cgroup_end_migration(memcg, page, new_page, true);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);

unlock_page(new_page);
unlock_page(page);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..d01a535 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -116,9 +116,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
pmd_t *pmd)
{
- spin_lock(&mm->page_table_lock);
+ spinlock_t *ptl = pmd_lock(mm, pmd);
set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknuma(*pmd));
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
}
#else
static inline void change_pmd_protnuma(struct mm_struct *mm, unsigned long addr,
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 41fee3e..cbb3854 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -151,7 +151,7 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
pgtable_t pgtable)
{
- assert_spin_locked(&mm->page_table_lock);
+ assert_spin_locked(pmd_lockptr(mm, pmdp));

/* FIFO */
if (!pmd_huge_pte(mm, pmdp))
@@ -170,7 +170,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
{
pgtable_t pgtable;

- assert_spin_locked(&mm->page_table_lock);
+ assert_spin_locked(pmd_lockptr(mm, pmdp));

/* FIFO */
pgtable = pmd_huge_pte(mm, pmdp);
--
1.8.4.rc3

2013-09-16 11:44:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCHv2 0/9] split page table lock for PMD tables

> - reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock;

So why is there still CONFIG_SPLIT_PTE_PTLOCK and
CONFIG_SPLIT_PMD_PTLOCK in the series?

2013-09-16 12:12:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 0/9] split page table lock for PMD tables

Peter Zijlstra wrote:
> > - reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock;
>
> So why is there still CONFIG_SPLIT_PTE_PTLOCK and
> CONFIG_SPLIT_PMD_PTLOCK in the series?

There is not.

We have USE_SPLIT_{PTE,PMD}_PTLOCK. We can't use split pmd lock everywhere
split pte lock is available: it requires separate enabling on arch side
and will not work on all hardware (i.e. with 2-level page table
configuration).

--
Kirill A. Shutemov

2013-09-19 17:16:58

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCHv2 0/9] split page table lock for PMD tables

On Mon, Sep 16, 2013 at 02:25:31PM +0300, Kirill A. Shutemov wrote:
> Alex Thorlton noticed that some massivly threaded workloads work poorly,
> if THP enabled. This patchset fixes this by introducing split page table
> lock for PMD tables. hugetlbfs is not covered yet.
>
> This patchset is based on work by Naoya Horiguchi.
>
> Changes:
> v2:
> - reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock;
> - s/huge_pmd_lock/pmd_lock/g;
> - assume pgtable_pmd_page_ctor() can fail;
> - fix format line in task_mem() for VmPTE;
>
> Benchmark (from Alex): ftp://shell.sgi.com/collect/appsx_test/pthread_test.tar.gz
> Run on 4 socket Westmere with 128 GiB of RAM.

Kirill,

I'm hitting some performance issues with these patches on our larger
machines (>=128 cores/256 threads). I've managed to livelock larger
systems with one of our tests (I'll publish this one soon), and I'm
actually seeing a performance hit on some of the smaller ones. I'm
currently collecting some results to show the problems I'm hitting,
and trying to research what's causing the livelock. For now I just
wanted to let you know that I'm seeing some issues. I'll be in touch
with more details.

Sorry for the delayed response, I wanted to be really sure that I was
seeing problems before I put up a red flag.

- Alex

2013-09-20 12:32:03

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 0/9] split page table lock for PMD tables

Alex Thorlton wrote:
> On Mon, Sep 16, 2013 at 02:25:31PM +0300, Kirill A. Shutemov wrote:
> > Alex Thorlton noticed that some massivly threaded workloads work poorly,
> > if THP enabled. This patchset fixes this by introducing split page table
> > lock for PMD tables. hugetlbfs is not covered yet.
> >
> > This patchset is based on work by Naoya Horiguchi.
> >
> > Changes:
> > v2:
> > - reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock;
> > - s/huge_pmd_lock/pmd_lock/g;
> > - assume pgtable_pmd_page_ctor() can fail;
> > - fix format line in task_mem() for VmPTE;
> >
> > Benchmark (from Alex): ftp://shell.sgi.com/collect/appsx_test/pthread_test.tar.gz
> > Run on 4 socket Westmere with 128 GiB of RAM.
>
> Kirill,
>
> I'm hitting some performance issues with these patches on our larger
> machines (>=128 cores/256 threads). I've managed to livelock larger
> systems with one of our tests (I'll publish this one soon), and I'm
> actually seeing a performance hit on some of the smaller ones.

Does "performance hit" mean performance degradation?

> I'm currently collecting some results to show the problems I'm hitting, and
> trying to research what's causing the livelock. For now I just wanted to let
> you know that I'm seeing some issues. I'll be in touch with more details.

Looking forward for more details. Thanks for testing.

I'll try to find a bigger machine myself.

--
Kirill A. Shutemov

2013-09-24 16:44:49

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCHv2 0/9] split page table lock for PMD tables

Kirill,

I've merged a couple of our e-mails together for this reply:

On Fri, Sep 20, 2013 at 03:31:37PM +0300, Kirill A. Shutemov wrote:
> Alex Thorlton noticed that some massivly threaded workloads work poorly,
> if THP enabled. This patchset fixes this by introducing split page table
> lock for PMD tables. hugetlbfs is not covered yet.
>
> This patchset is based on work by Naoya Horiguchi.
>
> Changes:
> v2:
> - reuse CONFIG_SPLIT_PTLOCK_CPUS for PMD split lock;
> - s/huge_pmd_lock/pmd_lock/g;
> - assume pgtable_pmd_page_ctor() can fail;
> - fix format line in task_mem() for VmPTE;
>
> Benchmark (from Alex): ftp://shell.sgi.com/collect/appsx_test/pthread_test.tar.gz
> Run on 4 socket Westmere with 128 GiB of RAM.

First off, this test, although still somewhat relevant here, wasn't
exactly crafted to show the particular page fault scaling problem that
I was looking to address by changing to the split PTLs. I wrote this
one to show what happens when an application allocates memory in such
a way that all of its pages get stuck on one node due to the use of
THP. This problem is much more evident on large machines, although
it appears to still be a significant issue even on this 4 socket.

> THP off:
> --------
>
> Performance counter stats for './thp_pthread -C 0 -m 0 -c 80 -b 100g' (5 runs):
>
> 1738259.808012 task-clock # 47.571 CPUs utilized ( +- 9.49% )
> 147,359 context-switches # 0.085 K/sec ( +- 9.67% )
> 14 cpu-migrations # 0.000 K/sec ( +- 13.25% )
> 24,410,139 page-faults # 0.014 M/sec ( +- 0.00% )
> 4,149,037,526,252 cycles # 2.387 GHz ( +- 9.50% )
> 3,649,839,735,027 stalled-cycles-frontend # 87.97% frontend cycles idle ( +- 6.60% )
> 2,455,558,969,567 stalled-cycles-backend # 59.18% backend cycles idle ( +- 22.92% )
> 1,434,961,518,604 instructions # 0.35 insns per cycle
> # 2.54 stalled cycles per insn ( +- 92.86% )
> 241,472,020,951 branches # 138.916 M/sec ( +- 91.72% )
> 84,022,172 branch-misses # 0.03% of all branches ( +- 3.16% )
>
> 36.540185552 seconds time elapsed ( +- 18.36% )

I'm assuming this was THP off, no patchset, correct?

> THP on, no patchset:
> --------------------
> Performance counter stats for './thp_pthread -C 0 -m 0 -c 80 -b 100g' (5 runs):
>
> 2528378.966949 task-clock # 50.715 CPUs utilized ( +- 11.86% )
> 214,063 context-switches # 0.085 K/sec ( +- 11.94% )
> 19 cpu-migrations # 0.000 K/sec ( +- 22.72% )
> 49,226 page-faults # 0.019 K/sec ( +- 0.33% )
> 6,034,640,598,498 cycles # 2.387 GHz ( +- 11.91% )
> 5,685,933,794,081 stalled-cycles-frontend # 94.22% frontend cycles idle ( +- 7.67% )
> 4,414,381,393,353 stalled-cycles-backend # 73.15% backend cycles idle ( +- 2.09% )
> 952,086,804,776 instructions # 0.16 insns per cycle
> # 5.97 stalled cycles per insn ( +- 89.59% )
> 166,191,211,974 branches # 65.730 M/sec ( +- 85.52% )
> 33,341,022 branch-misses # 0.02% of all branches ( +- 3.90% )
>
> 49.854741504 seconds time elapsed ( +- 14.76% )
>
> THP on, with patchset:
> ----------------------
>
> echo always > /sys/kernel/mm/transparent_hugepage/enabled
> Performance counter stats for './thp_pthread -C 0 -m 0 -c 80 -b 100g' (5 runs):
>
> 1538763.343568 task-clock # 45.386 CPUs utilized ( +- 7.21% )
> 130,469 context-switches # 0.085 K/sec ( +- 7.32% )
> 14 cpu-migrations # 0.000 K/sec ( +- 23.58% )
> 49,299 page-faults # 0.032 K/sec ( +- 0.15% )
> 3,666,748,502,650 cycles # 2.383 GHz ( +- 7.25% )
> 3,330,488,035,212 stalled-cycles-frontend # 90.83% frontend cycles idle ( +- 4.70% )
> 2,383,357,073,990 stalled-cycles-backend # 65.00% backend cycles idle ( +- 16.06% )
> 935,504,610,528 instructions # 0.26 insns per cycle
> # 3.56 stalled cycles per insn ( +- 91.16% )
> 161,466,689,532 branches # 104.933 M/sec ( +- 87.67% )
> 22,602,225 branch-misses # 0.01% of all branches ( +- 6.43% )
>
> 33.903917543 seconds time elapsed ( +- 12.57% )

These results all make sense, although I think that the granularity of
the locking here is only part of the issue. I can see that increasing
the granularity has gotten us a solid performance boost, but I think
we're still seeing the problem that this test was originally written to
cause. Can you run it with THP off, with the patchset, and see what
those results look like? In theory, they should be the same as the THP
off, no patchset, but I'd just like to confirm.

Alex Thorlton wrote:
> > Kirill,
> >
> > I'm hitting some performance issues with these patches on our larger
> > machines (>=128 cores/256 threads). I've managed to livelock larger
> > systems with one of our tests (I'll publish this one soon), and I'm
> > actually seeing a performance hit on some of the smaller ones.
>
> Does "performance hit" mean performance degradation?

Yes, sorry if I wasn't clear about that, although, after some more
thorough testing, I think the one test where I saw a performance
degradation was just a fluke. It looks like we are getting a bit of a
performance increase here.

Below are the results from a test that we use to get an idea of how
well page fault rates scale with threaded applications. I've
included a pointer to the code, after the results.

Here are my results from this test on 3.12-rc1:

Performance counter stats for './runt -t -c 512 -b 512m' (5 runs):

601876434.115594 task-clock # 528.537 CPUs utilized ( +- 0.47% ) [100.00%]
1623458 context-switches # 0.000 M/sec ( +- 1.53% ) [100.00%]
1516 CPU-migrations # 0.000 M/sec ( +- 2.77% ) [100.00%]
384463 page-faults # : 0.000 M/sec ( +- 2.67% )
1379590734249907 cycles # 2.292 GHz ( +- 0.44% ) [100.00%]
1341758692269274 stalled-cycles-frontend # 97.26% frontend cycles idle ( +- 0.40% ) [100.00%]
1221638993634458 stalled-cycles-backend # 88.55% backend cycles idle ( +- 0.40% ) [100.00%]
84690378275421 instructions # 0.06 insns per cycle
# 15.84 stalled cycles per insn ( +- 8.33% ) [100.00%]
31688643150502 branches # 52.650 M/sec ( +- 7.41% ) [100.00%]
596215444 branch-misses # 0.00% of all branches ( +- 1.29% )

1138.759708820 seconds time elapsed ( +- 0.47% )

And the same test on 3.12-rc1 with your patchset:

Performance counter stats for './runt -t -c 512 -b 512m' (5 runs):

589443340.500156 task-clock # 528.547 CPUs utilized ( +- 0.18% ) [100.00%]
1581275 context-switches # 0.000 M/sec ( +- 0.21% ) [100.00%]
1396 CPU-migrations # 0.000 M/sec ( +- 1.03% ) [100.00%]
396751 page-faults # 0.000 M/sec ( +- 1.00% )
1349678173115306 cycles # 2.290 GHz ( +- 0.19% ) [100.00%]
1312632685809142 stalled-cycles-frontend # 97.26% frontend cycles idle ( +- 0.16% ) [100.00%]
1195309877165326 stalled-cycles-backend # 88.56% backend cycles idle ( +- 0.15% ) [100.00%]
86399944669187 instructions # 0.06 insns per cycle
# 15.19 stalled cycles per insn ( +- 3.67% ) [100.00%]
32155627498040 branches # 54.553 M/sec ( +- 3.29% ) [100.00%]
571881894 branch-misses # 0.00% of all branches ( +- 0.85% )

1115.214191126 seconds time elapsed ( +- 0.18% )

Looks like we're getting a mild performance increase here, but we still
have a problem. Here are the results from my *very* primitive attempt
at something similar to your patches:

Performance counter stats for './runt -t -c 512 -b 512m' (5 runs):

29789210.870386 task-clock # 531.115 CPUs utilized ( +- 2.89% ) [100.00%]
116533 context-switches # 0.000 M/sec ( +- 9.63% ) [100.00%]
1035 CPU-migrations # 0.000 M/sec ( +- 1.31% ) [100.00%]
348985 page-faults # 0.000 M/sec ( +- 2.17% )
46809677863909 cycles # 1.571 GHz ( +- 4.17% ) [100.00%]
37484346945854 stalled-cycles-frontend # 80.08% frontend cycles idle ( +- 5.20% ) [100.00%]
34441194088742 stalled-cycles-backend # 73.58% backend cycles idle ( +- 5.84% ) [100.00%]
43467215183285 instructions # 0.93 insns per cycle
# 0.86 stalled cycles per insn ( +- 3.45% ) [100.00%]
14551423412527 branches # 488.480 M/sec ( +- 3.45% ) [100.00%]
141577229 branch-misses # 0.00% of all branches ( +- 2.27% )

56.088059717 seconds time elapsed ( +- 2.90% )

Now, I know that my patch doesn't perform proper locking here.
Actually, all I've done is narrow down the one lock where the test was
spending most of its time, and switch that over to the PUD split-PTL.
You can see the diff for that, here:

http://www.spinics.net/lists/kernel/msg1594736.html

Here's a pointer to the test that I used to get these results:

ftp://shell.sgi.com/collect/memscale/thp_memscale.tar.gz

And a brief description of how the test works:

- Once all threads are ready they are turned loose to perform the
following steps concurrently:
+ Reserve a chunk of memory local to each thread's CPU (-b
option)
+ Write to the first byte of each 4k chunk of the memory the
thread just allocated

Looking at these results, I think we can do better here, but I need to
really spend some time figuring out exactly what's causing the
difference in performance between the little tweak that I made, and the
full-fledged patches you've got here.

>From what I can tell we're getting hung up on the same lock that we were
originally getting stuck at (with this test, at least); "why" is a
different story.

I'm going to keep looking into this. Let me know if anybody has any
ideas!

- Alex

2013-09-26 10:51:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 0/9] split page table lock for PMD tables

Alex Thorlton wrote:
> > THP off:
> > --------
...
> > 36.540185552 seconds time elapsed ( +- 18.36% )
>
> I'm assuming this was THP off, no patchset, correct?

Yes. But THP off patched is *very* close to this, so I didn't post it separately.

> Here are my results from this test on 3.12-rc1:
...
> 1138.759708820 seconds time elapsed ( +- 0.47% )
>
> And the same test on 3.12-rc1 with your patchset:
>
> Performance counter stats for './runt -t -c 512 -b 512m' (5 runs):
...
> 1115.214191126 seconds time elapsed ( +- 0.18% )
>
> Looks like we're getting a mild performance increase here, but we still
> have a problem.

Let me guess: you have HUGETLBFS enabled in your config, right? ;)

HUGETLBFS hasn't converted to new locking and we disable split pmd lock if
HUGETLBFS is enabled.

I'm going to convert HUGETLBFS too, but it might take some time.

Without HUGETLBFS numbers looks pretty solid on my machine:

THP off, v3.12-rc2:
-------------------

Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

1037072.835207 task-clock # 57.426 CPUs utilized ( +- 3.59% )
95,093 context-switches # 0.092 K/sec ( +- 3.93% )
140 cpu-migrations # 0.000 K/sec ( +- 5.28% )
10,000,550 page-faults # 0.010 M/sec ( +- 0.00% )
2,455,210,400,261 cycles # 2.367 GHz ( +- 3.62% ) [83.33%]
2,429,281,882,056 stalled-cycles-frontend # 98.94% frontend cycles idle ( +- 3.67% ) [83.33%]
1,975,960,019,659 stalled-cycles-backend # 80.48% backend cycles idle ( +- 3.88% ) [66.68%]
46,503,296,013 instructions # 0.02 insns per cycle
# 52.24 stalled cycles per insn ( +- 3.21% ) [83.34%]
9,278,997,542 branches # 8.947 M/sec ( +- 4.00% ) [83.34%]
89,881,640 branch-misses # 0.97% of all branches ( +- 1.17% ) [83.33%]

18.059261877 seconds time elapsed ( +- 2.65% )

THP on, v3.12-rc2:
------------------

Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

3114745.395974 task-clock # 73.875 CPUs utilized ( +- 1.84% )
267,356 context-switches # 0.086 K/sec ( +- 1.84% )
99 cpu-migrations # 0.000 K/sec ( +- 1.40% )
58,313 page-faults # 0.019 K/sec ( +- 0.28% )
7,416,635,817,510 cycles # 2.381 GHz ( +- 1.83% ) [83.33%]
7,342,619,196,993 stalled-cycles-frontend # 99.00% frontend cycles idle ( +- 1.88% ) [83.33%]
6,267,671,641,967 stalled-cycles-backend # 84.51% backend cycles idle ( +- 2.03% ) [66.67%]
117,819,935,165 instructions # 0.02 insns per cycle
# 62.32 stalled cycles per insn ( +- 4.39% ) [83.34%]
28,899,314,777 branches # 9.278 M/sec ( +- 4.48% ) [83.34%]
71,787,032 branch-misses # 0.25% of all branches ( +- 1.03% ) [83.33%]

42.162306788 seconds time elapsed ( +- 1.73% )

THP off, patched, no HUGETLBFS:
-------------------------------

Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

943301.957892 task-clock # 56.256 CPUs utilized ( +- 3.01% )
86,218 context-switches # 0.091 K/sec ( +- 3.17% )
121 cpu-migrations # 0.000 K/sec ( +- 6.64% )
10,000,551 page-faults # 0.011 M/sec ( +- 0.00% )
2,230,462,457,654 cycles # 2.365 GHz ( +- 3.04% ) [83.32%]
2,204,616,385,805 stalled-cycles-frontend # 98.84% frontend cycles idle ( +- 3.09% ) [83.32%]
1,778,640,046,926 stalled-cycles-backend # 79.74% backend cycles idle ( +- 3.47% ) [66.69%]
45,995,472,617 instructions # 0.02 insns per cycle
# 47.93 stalled cycles per insn ( +- 2.51% ) [83.34%]
9,179,700,174 branches # 9.731 M/sec ( +- 3.04% ) [83.35%]
89,166,529 branch-misses # 0.97% of all branches ( +- 1.45% ) [83.33%]

16.768027318 seconds time elapsed ( +- 2.47% )

THP on, patched, no HUGETLBFS:
------------------------------

Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):

458793.837905 task-clock # 54.632 CPUs utilized ( +- 0.79% )
41,831 context-switches # 0.091 K/sec ( +- 0.97% )
98 cpu-migrations # 0.000 K/sec ( +- 1.66% )
57,829 page-faults # 0.126 K/sec ( +- 0.62% )
1,077,543,336,716 cycles # 2.349 GHz ( +- 0.81% ) [83.33%]
1,067,403,802,964 stalled-cycles-frontend # 99.06% frontend cycles idle ( +- 0.87% ) [83.33%]
864,764,616,143 stalled-cycles-backend # 80.25% backend cycles idle ( +- 0.73% ) [66.68%]
16,129,177,440 instructions # 0.01 insns per cycle
# 66.18 stalled cycles per insn ( +- 7.94% ) [83.35%]
3,618,938,569 branches # 7.888 M/sec ( +- 8.46% ) [83.36%]
33,242,032 branch-misses # 0.92% of all branches ( +- 2.02% ) [83.32%]

8.397885779 seconds time elapsed ( +- 0.18% )

--
Kirill A. Shutemov

2013-09-26 15:42:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 0/9] split page table lock for PMD tables

Kirill A. Shutemov wrote:
> Alex Thorlton wrote:
> > > THP off:
> > > --------
> ...
> > > 36.540185552 seconds time elapsed ( +- 18.36% )
> >
> > I'm assuming this was THP off, no patchset, correct?
>
> Yes. But THP off patched is *very* close to this, so I didn't post it separately.
>
> > Here are my results from this test on 3.12-rc1:
> ...
> > 1138.759708820 seconds time elapsed ( +- 0.47% )
> >
> > And the same test on 3.12-rc1 with your patchset:
> >
> > Performance counter stats for './runt -t -c 512 -b 512m' (5 runs):
> ...
> > 1115.214191126 seconds time elapsed ( +- 0.18% )
> >
> > Looks like we're getting a mild performance increase here, but we still
> > have a problem.
>
> Let me guess: you have HUGETLBFS enabled in your config, right? ;)
>
> HUGETLBFS hasn't converted to new locking and we disable split pmd lock if
> HUGETLBFS is enabled.
>
> I'm going to convert HUGETLBFS too, but it might take some time.

Okay, here is a bit reworked patch from Naoya Horiguchi.
It might need more cleanup.

Please, test and review.

>From 47e400fc308e0054c2cadf7df48f632555c83572 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Thu, 26 Sep 2013 17:51:33 +0300
Subject: [PATCH] mm/hugetlb: convert hugetlbfs to use split pmd lock

Hugetlb supports multiple page sizes. We use split lock only for PMD
level, but not for PUD.

I've run workload from Alex Thorlton[1], slightly modified to use
mmap(MAP_HUGETLB) for memory allocation.

hugetlbfs, v3.12-rc2:
---------------------

Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):

2588052.787264 task-clock # 54.400 CPUs utilized ( +- 3.69% )
246,831 context-switches # 0.095 K/sec ( +- 4.15% )
138 cpu-migrations # 0.000 K/sec ( +- 5.30% )
21,027 page-faults # 0.008 K/sec ( +- 0.01% )
6,166,666,307,263 cycles # 2.383 GHz ( +- 3.68% ) [83.33%]
6,086,008,929,407 stalled-cycles-frontend # 98.69% frontend cycles idle ( +- 3.77% ) [83.33%]
5,087,874,435,481 stalled-cycles-backend # 82.51% backend cycles idle ( +- 4.41% ) [66.67%]
133,782,831,249 instructions # 0.02 insns per cycle
# 45.49 stalled cycles per insn ( +- 4.30% ) [83.34%]
34,026,870,541 branches # 13.148 M/sec ( +- 4.24% ) [83.34%]
68,670,942 branch-misses # 0.20% of all branches ( +- 3.26% ) [83.33%]

47.574936948 seconds time elapsed ( +- 2.09% )

hugetlbfs, patched:
-------------------

Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):

395353.076837 task-clock # 20.329 CPUs utilized ( +- 8.16% )
55,730 context-switches # 0.141 K/sec ( +- 5.31% )
138 cpu-migrations # 0.000 K/sec ( +- 4.24% )
21,027 page-faults # 0.053 K/sec ( +- 0.00% )
930,219,717,244 cycles # 2.353 GHz ( +- 8.21% ) [83.32%]
914,295,694,103 stalled-cycles-frontend # 98.29% frontend cycles idle ( +- 8.35% ) [83.33%]
704,137,950,187 stalled-cycles-backend # 75.70% backend cycles idle ( +- 9.16% ) [66.69%]
30,541,538,385 instructions # 0.03 insns per cycle
# 29.94 stalled cycles per insn ( +- 3.98% ) [83.35%]
8,415,376,631 branches # 21.286 M/sec ( +- 3.61% ) [83.36%]
32,645,478 branch-misses # 0.39% of all branches ( +- 3.41% ) [83.32%]

19.447481153 seconds time elapsed ( +- 2.00% )

Split lock helps, but hugetlbs is still significantly slower the THP
(8.4 seconds).

[1] ftp://shell.sgi.com/collect/memscale/thp_memscale.tar.gz

Signed-off-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/proc/meminfo.c | 2 +-
include/linux/hugetlb.h | 19 +++++++++
include/linux/mm_types.h | 4 +-
include/linux/swapops.h | 9 ++--
mm/hugetlb.c | 108 ++++++++++++++++++++++++++++-------------------
mm/mempolicy.c | 5 ++-
mm/migrate.c | 7 +--
mm/rmap.c | 2 +-
8 files changed, 100 insertions(+), 56 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 59d85d6088..6d061f5359 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -1,8 +1,8 @@
#include <linux/fs.h>
-#include <linux/hugetlb.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/mm.h>
+#include <linux/hugetlb.h>
#include <linux/mman.h>
#include <linux/mmzone.h>
#include <linux/proc_fs.h>
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0393270466..4a4a73b1ec 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -401,6 +401,7 @@ struct hstate {};
#define hstate_sizelog(s) NULL
#define hstate_vma(v) NULL
#define hstate_inode(i) NULL
+#define page_hstate(page) NULL
#define huge_page_size(h) PAGE_SIZE
#define huge_page_mask(h) PAGE_MASK
#define vma_kernel_pagesize(v) PAGE_SIZE
@@ -423,4 +424,22 @@ static inline pgoff_t basepage_index(struct page *page)
#define hugepage_migration_support(h) 0
#endif /* CONFIG_HUGETLB_PAGE */

+static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
+ struct mm_struct *mm, pte_t *pte)
+{
+ if (huge_page_size(h) == PMD_SIZE)
+ return pmd_lockptr(mm, (pmd_t *) pte);
+ VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
+ return &mm->page_table_lock;
+}
+
+static inline spinlock_t *huge_pte_lock(struct hstate *h,
+ struct mm_struct *mm, pte_t *pte)
+{
+ spinlock_t *ptl;
+ ptl = huge_pte_lockptr(h, mm, pte);
+ spin_lock(ptl);
+ return ptl;
+}
+
#endif /* _LINUX_HUGETLB_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8c2a3c3e28..90cc93b2cc 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -24,10 +24,8 @@
struct address_space;

#define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
-/* hugetlb hasn't converted to split locking yet */
#define USE_SPLIT_PMD_PTLOCKS (USE_SPLIT_PTE_PTLOCKS && \
- IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK) && \
- !IS_ENABLED(CONFIG_HUGETLB_PAGE))
+ IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))

/*
* Each physical page in the system has a struct page associated with
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 8d4fa82bfb..ad9f4b2964 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -102,6 +102,8 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
}

+struct hstate;
+
#ifdef CONFIG_MIGRATION
static inline swp_entry_t make_migration_entry(struct page *page, int write)
{
@@ -139,7 +141,8 @@ static inline void make_migration_entry_read(swp_entry_t *entry)

extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address);
-extern void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte);
+extern void migration_entry_wait_huge(struct hstate *hstate,
+ struct mm_struct *mm, pte_t *pte);
#else

#define make_migration_entry(page, write) swp_entry(0, 0)
@@ -151,8 +154,8 @@ static inline int is_migration_entry(swp_entry_t swp)
static inline void make_migration_entry_read(swp_entry_t *entryp) { }
static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address) { }
-static inline void migration_entry_wait_huge(struct mm_struct *mm,
- pte_t *pte) { }
+static inline void migration_entry_wait_huge(struct hstate *hstate,
+ struct mm_struct *mm, pte_t *pte) { }
static inline int is_write_migration_entry(swp_entry_t entry)
{
return 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b49579c7f2..50cdeb6cb0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2361,6 +2361,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;

for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
+ spinlock_t *src_ptl, *dst_ptl;
src_pte = huge_pte_offset(src, addr);
if (!src_pte)
continue;
@@ -2372,8 +2373,9 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
if (dst_pte == src_pte)
continue;

- spin_lock(&dst->page_table_lock);
- spin_lock_nested(&src->page_table_lock, SINGLE_DEPTH_NESTING);
+ dst_ptl = huge_pte_lock(h, dst, dst_pte);
+ src_ptl = huge_pte_lockptr(h, src, src_pte);
+ spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
if (!huge_pte_none(huge_ptep_get(src_pte))) {
if (cow)
huge_ptep_set_wrprotect(src, addr, src_pte);
@@ -2383,8 +2385,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
page_dup_rmap(ptepage);
set_huge_pte_at(dst, addr, dst_pte, entry);
}
- spin_unlock(&src->page_table_lock);
- spin_unlock(&dst->page_table_lock);
+ spin_unlock(src_ptl);
+ spin_unlock(dst_ptl);
}
return 0;

@@ -2427,6 +2429,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
unsigned long address;
pte_t *ptep;
pte_t pte;
+ spinlock_t *ptl;
struct page *page;
struct hstate *h = hstate_vma(vma);
unsigned long sz = huge_page_size(h);
@@ -2440,25 +2443,25 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
tlb_start_vma(tlb, vma);
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
again:
- spin_lock(&mm->page_table_lock);
for (address = start; address < end; address += sz) {
ptep = huge_pte_offset(mm, address);
if (!ptep)
continue;

+ ptl = huge_pte_lock(h, mm, ptep);
if (huge_pmd_unshare(mm, &address, ptep))
- continue;
+ goto unlock;

pte = huge_ptep_get(ptep);
if (huge_pte_none(pte))
- continue;
+ goto unlock;

/*
* HWPoisoned hugepage is already unmapped and dropped reference
*/
if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
huge_pte_clear(mm, address, ptep);
- continue;
+ goto unlock;
}

page = pte_page(pte);
@@ -2469,7 +2472,7 @@ again:
*/
if (ref_page) {
if (page != ref_page)
- continue;
+ goto unlock;

/*
* Mark the VMA as having unmapped its page so that
@@ -2486,13 +2489,18 @@ again:

page_remove_rmap(page);
force_flush = !__tlb_remove_page(tlb, page);
- if (force_flush)
+ if (force_flush) {
+ spin_unlock(ptl);
break;
+ }
/* Bail out after unmapping reference page if supplied */
- if (ref_page)
+ if (ref_page) {
+ spin_unlock(ptl);
break;
+ }
+unlock:
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);
/*
* mmu_gather ran out of room to batch pages, we break out of
* the PTE lock to avoid doing the potential expensive TLB invalidate
@@ -2598,7 +2606,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
*/
static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pte_t pte,
- struct page *pagecache_page)
+ struct page *pagecache_page, spinlock_t *ptl)
{
struct hstate *h = hstate_vma(vma);
struct page *old_page, *new_page;
@@ -2632,8 +2640,8 @@ retry_avoidcopy:

page_cache_get(old_page);

- /* Drop page_table_lock as buddy allocator may be called */
- spin_unlock(&mm->page_table_lock);
+ /* Drop page table lock as buddy allocator may be called */
+ spin_unlock(ptl);
new_page = alloc_huge_page(vma, address, outside_reserve);

if (IS_ERR(new_page)) {
@@ -2651,12 +2659,12 @@ retry_avoidcopy:
BUG_ON(huge_pte_none(pte));
if (unmap_ref_private(mm, vma, old_page, address)) {
BUG_ON(huge_pte_none(pte));
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
ptep = huge_pte_offset(mm, address & huge_page_mask(h));
if (likely(pte_same(huge_ptep_get(ptep), pte)))
goto retry_avoidcopy;
/*
- * race occurs while re-acquiring page_table_lock, and
+ * race occurs while re-acquiring page table lock, and
* our job is done.
*/
return 0;
@@ -2665,7 +2673,7 @@ retry_avoidcopy:
}

/* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
if (err == -ENOMEM)
return VM_FAULT_OOM;
else
@@ -2680,7 +2688,7 @@ retry_avoidcopy:
page_cache_release(new_page);
page_cache_release(old_page);
/* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
return VM_FAULT_OOM;
}

@@ -2692,10 +2700,10 @@ retry_avoidcopy:
mmun_end = mmun_start + huge_page_size(h);
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
/*
- * Retake the page_table_lock to check for racing updates
+ * Retake the page table lock to check for racing updates
* before the page tables are altered
*/
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
ptep = huge_pte_offset(mm, address & huge_page_mask(h));
if (likely(pte_same(huge_ptep_get(ptep), pte))) {
ClearPagePrivate(new_page);
@@ -2709,13 +2717,13 @@ retry_avoidcopy:
/* Make the old page be freed below */
new_page = old_page;
}
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
page_cache_release(new_page);
page_cache_release(old_page);

/* Caller expects lock to be held */
- spin_lock(&mm->page_table_lock);
+ spin_lock(ptl);
return 0;
}

@@ -2763,6 +2771,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
struct address_space *mapping;
pte_t new_pte;
+ spinlock_t *ptl;

/*
* Currently, we are forced to kill the process in the event the
@@ -2849,7 +2858,8 @@ retry:
goto backout_unlocked;
}

- spin_lock(&mm->page_table_lock);
+ ptl = huge_pte_lockptr(h, mm, ptep);
+ spin_lock(ptl);
size = i_size_read(mapping->host) >> huge_page_shift(h);
if (idx >= size)
goto backout;
@@ -2870,16 +2880,16 @@ retry:

if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
/* Optimization, do the COW without a second fault */
- ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page);
+ ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page, ptl);
}

- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
unlock_page(page);
out:
return ret;

backout:
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
backout_unlocked:
unlock_page(page);
put_page(page);
@@ -2891,6 +2901,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
{
pte_t *ptep;
pte_t entry;
+ spinlock_t *ptl;
int ret;
struct page *page = NULL;
struct page *pagecache_page = NULL;
@@ -2903,7 +2914,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (ptep) {
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
- migration_entry_wait_huge(mm, ptep);
+ migration_entry_wait_huge(h, mm, ptep);
return 0;
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
@@ -2959,17 +2970,18 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (page != pagecache_page)
lock_page(page);

- spin_lock(&mm->page_table_lock);
+ ptl = huge_pte_lockptr(h, mm, ptep);
+ spin_lock(ptl);
/* Check for a racing update before calling hugetlb_cow */
if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
- goto out_page_table_lock;
+ goto out_ptl;


if (flags & FAULT_FLAG_WRITE) {
if (!huge_pte_write(entry)) {
ret = hugetlb_cow(mm, vma, address, ptep, entry,
- pagecache_page);
- goto out_page_table_lock;
+ pagecache_page, ptl);
+ goto out_ptl;
}
entry = huge_pte_mkdirty(entry);
}
@@ -2978,8 +2990,8 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
flags & FAULT_FLAG_WRITE))
update_mmu_cache(vma, address, ptep);

-out_page_table_lock:
- spin_unlock(&mm->page_table_lock);
+out_ptl:
+ spin_unlock(ptl);

if (pagecache_page) {
unlock_page(pagecache_page);
@@ -3005,9 +3017,9 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long remainder = *nr_pages;
struct hstate *h = hstate_vma(vma);

- spin_lock(&mm->page_table_lock);
while (vaddr < vma->vm_end && remainder) {
pte_t *pte;
+ spinlock_t *ptl = NULL;
int absent;
struct page *page;

@@ -3015,8 +3027,12 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
* Some archs (sparc64, sh*) have multiple pte_ts to
* each hugepage. We have to make sure we get the
* first, for the page indexing below to work.
+ *
+ * Note that page table lock is not held when pte is null.
*/
pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
+ if (pte)
+ ptl = huge_pte_lock(h, mm, pte);
absent = !pte || huge_pte_none(huge_ptep_get(pte));

/*
@@ -3028,6 +3044,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (absent && (flags & FOLL_DUMP) &&
!hugetlbfs_pagecache_present(h, vma, vaddr)) {
+ if (pte)
+ spin_unlock(ptl);
remainder = 0;
break;
}
@@ -3047,10 +3065,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
!huge_pte_write(huge_ptep_get(pte)))) {
int ret;

- spin_unlock(&mm->page_table_lock);
+ if (pte)
+ spin_unlock(ptl);
ret = hugetlb_fault(mm, vma, vaddr,
(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
- spin_lock(&mm->page_table_lock);
if (!(ret & VM_FAULT_ERROR))
continue;

@@ -3081,8 +3099,8 @@ same_page:
*/
goto same_page;
}
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);
*nr_pages = remainder;
*position = vaddr;

@@ -3103,13 +3121,15 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
flush_cache_range(vma, address, end);

mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
- spin_lock(&mm->page_table_lock);
for (; address < end; address += huge_page_size(h)) {
+ spinlock_t *ptl;
ptep = huge_pte_offset(mm, address);
if (!ptep)
continue;
+ ptl = huge_pte_lock(h, mm, ptep);
if (huge_pmd_unshare(mm, &address, ptep)) {
pages++;
+ spin_unlock(ptl);
continue;
}
if (!huge_pte_none(huge_ptep_get(ptep))) {
@@ -3119,8 +3139,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
set_huge_pte_at(mm, address, ptep, pte);
pages++;
}
+ spin_unlock(ptl);
}
- spin_unlock(&mm->page_table_lock);
/*
* Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
* may have cleared our pud entry and done put_page on the page table:
@@ -3283,6 +3303,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
unsigned long saddr;
pte_t *spte = NULL;
pte_t *pte;
+ spinlock_t *ptl;

if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);
@@ -3305,13 +3326,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
if (!spte)
goto out;

- spin_lock(&mm->page_table_lock);
+ ptl = huge_pte_lockptr(hstate_vma(vma), mm, spte);
+ spin_lock(ptl);
if (pud_none(*pud))
pud_populate(mm, pud,
(pmd_t *)((unsigned long)spte & PAGE_MASK));
else
put_page(virt_to_page(spte));
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
mutex_unlock(&mapping->i_mmap_mutex);
@@ -3325,7 +3347,7 @@ out:
* indicated by page_count > 1, unmap is achieved by clearing pud and
* decrementing the ref count. If count == 1, the pte page is not shared.
*
- * called with vma->vm_mm->page_table_lock held.
+ * called with page table lock held.
*
* returns: 1 successfully unmapped a shared pte page
* 0 the underlying pte page is not shared, or it is the last user
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 04729647f3..930a3e64bd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -525,8 +525,9 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
#ifdef CONFIG_HUGETLB_PAGE
int nid;
struct page *page;
+ spinlock_t *ptl;

- spin_lock(&vma->vm_mm->page_table_lock);
+ ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);
page = pte_page(huge_ptep_get((pte_t *)pmd));
nid = page_to_nid(page);
if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
@@ -536,7 +537,7 @@ static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
(flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
isolate_huge_page(page, private);
unlock:
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(ptl);
#else
BUG();
#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index af71f53246..12bf3b8216 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -130,7 +130,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
ptep = huge_pte_offset(mm, addr);
if (!ptep)
goto out;
- ptl = &mm->page_table_lock;
+ ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
} else {
pmd = mm_find_pmd(mm, addr);
if (!pmd)
@@ -247,9 +247,10 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
__migration_entry_wait(mm, ptep, ptl);
}

-void migration_entry_wait_huge(struct mm_struct *mm, pte_t *pte)
+void migration_entry_wait_huge(struct hstate *h,
+ struct mm_struct *mm, pte_t *pte)
{
- spinlock_t *ptl = &(mm)->page_table_lock;
+ spinlock_t *ptl = huge_pte_lockptr(h, mm, pte);
__migration_entry_wait(mm, pte, ptl);
}

diff --git a/mm/rmap.c b/mm/rmap.c
index b59d741dcf..55c8b8dc9f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -601,7 +601,7 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm,

if (unlikely(PageHuge(page))) {
pte = huge_pte_offset(mm, address);
- ptl = &mm->page_table_lock;
+ ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
goto check;
}

--
Kirill A. Shutemov

2013-09-26 21:19:37

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCHv2 0/9] split page table lock for PMD tables

> Let me guess: you have HUGETLBFS enabled in your config, right? ;)
>
> HUGETLBFS hasn't converted to new locking and we disable split pmd lock if
> HUGETLBFS is enabled.

Ahhhhh, that's got it! I double checked my config a million times to
make sure that I wasn't going crazy, but I must've missed that. With
that fixed, it's performing exactly how I thought it should. Looking
great to me!

Sorry for the confusion!

- Alex

2013-09-26 21:38:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 0/9] split page table lock for PMD tables

Alex Thorlton wrote:
> > Let me guess: you have HUGETLBFS enabled in your config, right? ;)
> >
> > HUGETLBFS hasn't converted to new locking and we disable split pmd lock if
> > HUGETLBFS is enabled.
>
> Ahhhhh, that's got it! I double checked my config a million times to
> make sure that I wasn't going crazy, but I must've missed that. With
> that fixed, it's performing exactly how I thought it should. Looking
> great to me!

Can I use your Reviewed-by?

--
Kirill A. Shutemov

2013-09-26 21:43:18

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCHv2 0/9] split page table lock for PMD tables

On Fri, Sep 27, 2013 at 12:38:07AM +0300, Kirill A. Shutemov wrote:
> Alex Thorlton wrote:
> > > Let me guess: you have HUGETLBFS enabled in your config, right? ;)
> > >
> > > HUGETLBFS hasn't converted to new locking and we disable split pmd lock if
> > > HUGETLBFS is enabled.
> >
> > Ahhhhh, that's got it! I double checked my config a million times to
> > make sure that I wasn't going crazy, but I must've missed that. With
> > that fixed, it's performing exactly how I thought it should. Looking
> > great to me!
>
> Can I use your Reviewed-by?

Sure, no problem.

- Alex

2013-09-26 21:44:56

by Alex Thorlton

[permalink] [raw]
Subject: Re: [PATCHv2 0/9] split page table lock for PMD tables

On Fri, Sep 27, 2013 at 12:42:46AM +0300, Kirill A. Shutemov wrote:
> Kirill A. Shutemov wrote:
> > Alex Thorlton wrote:
> > > > Let me guess: you have HUGETLBFS enabled in your config, right? ;)
> > > >
> > > > HUGETLBFS hasn't converted to new locking and we disable split pmd lock if
> > > > HUGETLBFS is enabled.
> > >
> > > Ahhhhh, that's got it! I double checked my config a million times to
> > > make sure that I wasn't going crazy, but I must've missed that. With
> > > that fixed, it's performing exactly how I thought it should. Looking
> > > great to me!
> >
> > Can I use your Reviewed-by?
>
> s/Reviewed-by/Tested-by/

That too. Whatever you need, haha.

- Alex

2013-09-27 00:06:00

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCHv2 0/9] split page table lock for PMD tables

On Thu, Sep 26, 2013 at 06:42:24PM +0300, Kirill A. Shutemov wrote:
> Kirill A. Shutemov wrote:
> > Alex Thorlton wrote:
> > > > THP off:
> > > > --------
> > ...
> > > > 36.540185552 seconds time elapsed ( +- 18.36% )
> > >
> > > I'm assuming this was THP off, no patchset, correct?
> >
> > Yes. But THP off patched is *very* close to this, so I didn't post it separately.
> >
> > > Here are my results from this test on 3.12-rc1:
> > ...
> > > 1138.759708820 seconds time elapsed ( +- 0.47% )
> > >
> > > And the same test on 3.12-rc1 with your patchset:
> > >
> > > Performance counter stats for './runt -t -c 512 -b 512m' (5 runs):
> > ...
> > > 1115.214191126 seconds time elapsed ( +- 0.18% )
> > >
> > > Looks like we're getting a mild performance increase here, but we still
> > > have a problem.
> >
> > Let me guess: you have HUGETLBFS enabled in your config, right? ;)
> >
> > HUGETLBFS hasn't converted to new locking and we disable split pmd lock if
> > HUGETLBFS is enabled.
> >
> > I'm going to convert HUGETLBFS too, but it might take some time.
>
> Okay, here is a bit reworked patch from Naoya Horiguchi.
> It might need more cleanup.
>
> Please, test and review.
>
> From 47e400fc308e0054c2cadf7df48f632555c83572 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <[email protected]>
> Date: Thu, 26 Sep 2013 17:51:33 +0300
> Subject: [PATCH] mm/hugetlb: convert hugetlbfs to use split pmd lock
>
> Hugetlb supports multiple page sizes. We use split lock only for PMD
> level, but not for PUD.

I like this simple approach, because I don't think the benefit of doing
split ptl for PUD is large enough comparing with the cost of adding spinlock
initialization on every pud. Maybe we might as well consider this when
pud hugepage will be widely used.

> I've run workload from Alex Thorlton[1], slightly modified to use
> mmap(MAP_HUGETLB) for memory allocation.
>
> hugetlbfs, v3.12-rc2:
> ---------------------
>
> Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):
>
> 2588052.787264 task-clock # 54.400 CPUs utilized ( +- 3.69% )
> 246,831 context-switches # 0.095 K/sec ( +- 4.15% )
> 138 cpu-migrations # 0.000 K/sec ( +- 5.30% )
> 21,027 page-faults # 0.008 K/sec ( +- 0.01% )
> 6,166,666,307,263 cycles # 2.383 GHz ( +- 3.68% ) [83.33%]
> 6,086,008,929,407 stalled-cycles-frontend # 98.69% frontend cycles idle ( +- 3.77% ) [83.33%]
> 5,087,874,435,481 stalled-cycles-backend # 82.51% backend cycles idle ( +- 4.41% ) [66.67%]
> 133,782,831,249 instructions # 0.02 insns per cycle
> # 45.49 stalled cycles per insn ( +- 4.30% ) [83.34%]
> 34,026,870,541 branches # 13.148 M/sec ( +- 4.24% ) [83.34%]
> 68,670,942 branch-misses # 0.20% of all branches ( +- 3.26% ) [83.33%]
>
> 47.574936948 seconds time elapsed ( +- 2.09% )
>
> hugetlbfs, patched:
> -------------------
>
> Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):
>
> 395353.076837 task-clock # 20.329 CPUs utilized ( +- 8.16% )
> 55,730 context-switches # 0.141 K/sec ( +- 5.31% )
> 138 cpu-migrations # 0.000 K/sec ( +- 4.24% )
> 21,027 page-faults # 0.053 K/sec ( +- 0.00% )
> 930,219,717,244 cycles # 2.353 GHz ( +- 8.21% ) [83.32%]
> 914,295,694,103 stalled-cycles-frontend # 98.29% frontend cycles idle ( +- 8.35% ) [83.33%]
> 704,137,950,187 stalled-cycles-backend # 75.70% backend cycles idle ( +- 9.16% ) [66.69%]
> 30,541,538,385 instructions # 0.03 insns per cycle
> # 29.94 stalled cycles per insn ( +- 3.98% ) [83.35%]
> 8,415,376,631 branches # 21.286 M/sec ( +- 3.61% ) [83.36%]
> 32,645,478 branch-misses # 0.39% of all branches ( +- 3.41% ) [83.32%]
>
> 19.447481153 seconds time elapsed ( +- 2.00% )
>
> Split lock helps, but hugetlbs is still significantly slower the THP
> (8.4 seconds).

The difference is interesting, but it can be the different problem
from our current problem. Even in vanilla kernel, hugetlbfs looks slower.

> [1] ftp://shell.sgi.com/collect/memscale/thp_memscale.tar.gz
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> fs/proc/meminfo.c | 2 +-
> include/linux/hugetlb.h | 19 +++++++++
> include/linux/mm_types.h | 4 +-
> include/linux/swapops.h | 9 ++--
> mm/hugetlb.c | 108 ++++++++++++++++++++++++++++-------------------
> mm/mempolicy.c | 5 ++-
> mm/migrate.c | 7 +--
> mm/rmap.c | 2 +-
> 8 files changed, 100 insertions(+), 56 deletions(-)
>
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 59d85d6088..6d061f5359 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -1,8 +1,8 @@
> #include <linux/fs.h>
> -#include <linux/hugetlb.h>
> #include <linux/init.h>
> #include <linux/kernel.h>
> #include <linux/mm.h>
> +#include <linux/hugetlb.h>
> #include <linux/mman.h>
> #include <linux/mmzone.h>
> #include <linux/proc_fs.h>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 0393270466..4a4a73b1ec 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -401,6 +401,7 @@ struct hstate {};
> #define hstate_sizelog(s) NULL
> #define hstate_vma(v) NULL
> #define hstate_inode(i) NULL
> +#define page_hstate(page) NULL
> #define huge_page_size(h) PAGE_SIZE
> #define huge_page_mask(h) PAGE_MASK
> #define vma_kernel_pagesize(v) PAGE_SIZE
> @@ -423,4 +424,22 @@ static inline pgoff_t basepage_index(struct page *page)
> #define hugepage_migration_support(h) 0
> #endif /* CONFIG_HUGETLB_PAGE */
>
> +static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
> + struct mm_struct *mm, pte_t *pte)
> +{
> + if (huge_page_size(h) == PMD_SIZE)
> + return pmd_lockptr(mm, (pmd_t *) pte);
> + VM_BUG_ON(huge_page_size(h) == PAGE_SIZE);
> + return &mm->page_table_lock;
> +}
> +
> +static inline spinlock_t *huge_pte_lock(struct hstate *h,
> + struct mm_struct *mm, pte_t *pte)
> +{
> + spinlock_t *ptl;
> + ptl = huge_pte_lockptr(h, mm, pte);
> + spin_lock(ptl);
> + return ptl;
> +}
> +
> #endif /* _LINUX_HUGETLB_H */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 8c2a3c3e28..90cc93b2cc 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -24,10 +24,8 @@
> struct address_space;
>
> #define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
> -/* hugetlb hasn't converted to split locking yet */
> #define USE_SPLIT_PMD_PTLOCKS (USE_SPLIT_PTE_PTLOCKS && \
> - IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK) && \
> - !IS_ENABLED(CONFIG_HUGETLB_PAGE))
> + IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))
>
> /*
> * Each physical page in the system has a struct page associated with
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 8d4fa82bfb..ad9f4b2964 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -102,6 +102,8 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
> return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
> }
>
> +struct hstate;
> +
> #ifdef CONFIG_MIGRATION
> static inline swp_entry_t make_migration_entry(struct page *page, int write)
> {

We can avoid this workaround by passing vma to migration_entry_wait_huge(),
and getting hstate inside that function.

Thanks,
Naoya Horiguchi