2012-08-29 15:33:56

by Gerald Schaefer

[permalink] [raw]
Subject: [RFC v2 PATCH 0/7] thp: transparent hugepages on s390

This patch series adds support for transparent hugepages on System z.
Small changes to common code are necessary with regard to a different
pgtable_t, tlb flushing and kvm behaviour on s390, see patches 1 to 3.

Changes to v2:
[PATCH 1/7] Fix build problem on non-s390 architecture.
[PATCH 2/7] Fix build problem on non-s390 architecture.
[PATCH 7/7] Make use of rrbm instruction if available.

Gerald Schaefer (7):
thp: remove assumptions on pgtable_t type
thp: introduce pmdp_invalidate()
thp: make MADV_HUGEPAGE check for mm->def_flags
thp, s390: thp splitting backend for s390
thp, s390: thp pagetable pre-allocation for s390
thp, s390: disable thp for kvm host on s390
thp, s390: architecture backend for thp on s390

arch/s390/include/asm/hugetlb.h | 18 +---
arch/s390/include/asm/pgtable.h | 208 ++++++++++++++++++++++++++++++++++++++++
arch/s390/include/asm/setup.h | 3 +
arch/s390/include/asm/tlb.h | 1 +
arch/s390/kernel/early.c | 2 +
arch/s390/mm/gup.c | 11 ++-
arch/s390/mm/pgtable.c | 108 +++++++++++++++++++++
include/asm-generic/pgtable.h | 13 +++
include/linux/huge_mm.h | 1 -
mm/Kconfig | 2 +-
mm/huge_memory.c | 53 +++-------
mm/pgtable-generic.c | 50 ++++++++++
12 files changed, 408 insertions(+), 62 deletions(-)

--
1.7.11.5


2012-08-29 15:33:52

by Gerald Schaefer

[permalink] [raw]
Subject: [RFC v2 PATCH 4/7] thp, s390: thp splitting backend for s390

This patch is part of the architecture backend for thp on s390.
It provides the functions related to thp splitting, including
serialization against gup. Unlike other archs, pmdp_splitting_flush()
cannot use a tlb flushing operation to serialize against gup on s390,
because that wouldn't be stopped by the disabled IRQs. So instead,
smp_call_function() is called with an empty function, which will have
the expected effect.

Signed-off-by: Gerald Schaefer <[email protected]>
---
arch/s390/include/asm/pgtable.h | 13 +++++++++++++
arch/s390/mm/gup.c | 11 ++++++++++-
arch/s390/mm/pgtable.c | 18 ++++++++++++++++++
3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 6bd7d74..c3f0775 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -347,6 +347,8 @@ extern struct page *vmemmap;

#define _SEGMENT_ENTRY_LARGE 0x400 /* STE-format control, large page */
#define _SEGMENT_ENTRY_CO 0x100 /* change-recording override */
+#define _SEGMENT_ENTRY_SPLIT_BIT 0 /* THP splitting bit number */
+#define _SEGMENT_ENTRY_SPLIT (1UL << _SEGMENT_ENTRY_SPLIT_BIT)

/* Page status table bits for virtualization */
#define RCP_ACC_BITS 0xf000000000000000UL
@@ -506,6 +508,10 @@ static inline int pmd_bad(pmd_t pmd)
return (pmd_val(pmd) & mask) != _SEGMENT_ENTRY;
}

+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+extern void pmdp_splitting_flush(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmdp);
+
static inline int pte_none(pte_t pte)
{
return (pte_val(pte) & _PAGE_INVALID) && !(pte_val(pte) & _PAGE_SWT);
@@ -1159,6 +1165,13 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
#define pte_offset_map(pmd, address) pte_offset_kernel(pmd, address)
#define pte_unmap(pte) do { } while (0)

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pmd_trans_splitting(pmd_t pmd)
+{
+ return pmd_val(pmd) & _SEGMENT_ENTRY_SPLIT;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
/*
* 31 bit swap entry format:
* A page-table entry has some bits we have to treat in a special way.
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 65cb06e..d34fd6e 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -115,7 +115,16 @@ static inline int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr,
pmd = *pmdp;
barrier();
next = pmd_addr_end(addr, end);
- if (pmd_none(pmd))
+ /*
+ * The pmd_trans_splitting() check below explains why
+ * pmdp_splitting_flush() has to serialize with
+ * smp_call_function() against our disabled IRQs, to stop
+ * this gup-fast code from running while we set the
+ * splitting bit in the pmd. Returning zero will take
+ * the slow path that will call wait_split_huge_page()
+ * if the pmd is still in splitting state.
+ */
+ if (pmd_none(pmd) || pmd_trans_splitting(pmd))
return 0;
if (unlikely(pmd_huge(pmd))) {
if (!gup_huge_pmd(pmdp, pmd, addr, next,
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 18df31d..5a73522 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -866,3 +866,21 @@ bool kernel_page_present(struct page *page)
return cc == 0;
}
#endif /* CONFIG_HIBERNATION && CONFIG_DEBUG_PAGEALLOC */
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pmdp_splitting_flush_sync(void *arg)
+{
+ /* Simply deliver the interrupt */
+}
+
+void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmdp)
+{
+ VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+ if (!test_and_set_bit(_SEGMENT_ENTRY_SPLIT_BIT,
+ (unsigned long *) pmdp)) {
+ /* need to serialize against gup-fast (IRQ disabled) */
+ smp_call_function(pmdp_splitting_flush_sync, NULL, 1);
+ }
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
--
1.7.11.5

2012-08-29 15:34:00

by Gerald Schaefer

[permalink] [raw]
Subject: [RFC v2 PATCH 3/7] thp: make MADV_HUGEPAGE check for mm->def_flags

This adds a check to hugepage_madvise(), to refuse MADV_HUGEPAGE
if VM_NOHUGEPAGE is set in mm->def_flags. On s390, the VM_NOHUGEPAGE
flag will be set in mm->def_flags for kvm processes, to prevent any
future thp mappings. In order to also prevent MADV_HUGEPAGE on such an
mm, hugepage_madvise() should check mm->def_flags.

Signed-off-by: Gerald Schaefer <[email protected]>
---
mm/huge_memory.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8bac516..f39cb03 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1464,6 +1464,8 @@ out:
int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
{
+ struct mm_struct *mm = vma->vm_mm;
+
switch (advice) {
case MADV_HUGEPAGE:
/*
@@ -1471,6 +1473,8 @@ int hugepage_madvise(struct vm_area_struct *vma,
*/
if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
return -EINVAL;
+ if (mm->def_flags & VM_NOHUGEPAGE)
+ return -EINVAL;
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
--
1.7.11.5

2012-08-29 15:34:13

by Gerald Schaefer

[permalink] [raw]
Subject: [RFC v2 PATCH 1/7] thp: remove assumptions on pgtable_t type

The thp page table pre-allocation code currently assumes that pgtable_t
is of type "struct page *". This may not be true for all architectures,
so this patch removes that assumption by replacing the functions
prepare_pmd_huge_pte() and get_pmd_huge_pte() with two new functions
that can be defined architecture-specific.

It also removes two VM_BUG_ON checks for page_count() and page_mapcount()
operating on a pgtable_t. Apart from the VM_BUG_ON removal, there will
be no functional change introduced by this patch.

Signed-off-by: Gerald Schaefer <[email protected]>
---
include/asm-generic/pgtable.h | 8 ++++++++
include/linux/huge_mm.h | 1 -
mm/huge_memory.c | 46 ++++++-------------------------------------
mm/pgtable-generic.c | 39 ++++++++++++++++++++++++++++++++++++
4 files changed, 53 insertions(+), 41 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index ff4947b..f756f60 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -162,6 +162,14 @@ extern void pmdp_splitting_flush(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp);
#endif

+#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
+extern void pgtable_deposit(struct mm_struct *mm, pgtable_t pgtable);
+#endif
+
+#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
+extern pgtable_t pgtable_withdraw(struct mm_struct *mm);
+#endif
+
#ifndef __HAVE_ARCH_PTE_SAME
static inline int pte_same(pte_t pte_a, pte_t pte_b)
{
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4c59b11..6ab47af 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,7 +11,6 @@ extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
pmd_t orig_pmd);
-extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
unsigned long addr,
pmd_t *pmd,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57c4b93..6805328 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -611,19 +611,6 @@ out:
}
__setup("transparent_hugepage=", setup_transparent_hugepage);

-static void prepare_pmd_huge_pte(pgtable_t pgtable,
- struct mm_struct *mm)
-{
- assert_spin_locked(&mm->page_table_lock);
-
- /* FIFO */
- if (!mm->pmd_huge_pte)
- INIT_LIST_HEAD(&pgtable->lru);
- else
- list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
- mm->pmd_huge_pte = pgtable;
-}
-
static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
{
if (likely(vma->vm_flags & VM_WRITE))
@@ -665,7 +652,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
*/
page_add_new_anon_rmap(page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
- prepare_pmd_huge_pte(pgtable, mm);
+ pgtable_deposit(mm, pgtable);
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
mm->nr_ptes++;
spin_unlock(&mm->page_table_lock);
@@ -791,7 +778,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmdp_set_wrprotect(src_mm, addr, src_pmd);
pmd = pmd_mkold(pmd_wrprotect(pmd));
set_pmd_at(dst_mm, addr, dst_pmd, pmd);
- prepare_pmd_huge_pte(pgtable, dst_mm);
+ pgtable_deposit(dst_mm, pgtable);
dst_mm->nr_ptes++;

ret = 0;
@@ -802,25 +789,6 @@ out:
return ret;
}

-/* no "address" argument so destroys page coloring of some arch */
-pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
-{
- pgtable_t pgtable;
-
- assert_spin_locked(&mm->page_table_lock);
-
- /* FIFO */
- pgtable = mm->pmd_huge_pte;
- if (list_empty(&pgtable->lru))
- mm->pmd_huge_pte = NULL;
- else {
- mm->pmd_huge_pte = list_entry(pgtable->lru.next,
- struct page, lru);
- list_del(&pgtable->lru);
- }
- return pgtable;
-}
-
static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
@@ -876,7 +844,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
pmdp_clear_flush_notify(vma, haddr, pmd);
/* leave pmd empty until pte is filled */

- pgtable = get_pmd_huge_pte(mm);
+ pgtable = pgtable_withdraw(mm);
pmd_populate(mm, &_pmd, pgtable);

for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
@@ -1041,7 +1009,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (__pmd_trans_huge_lock(pmd, vma) == 1) {
struct page *page;
pgtable_t pgtable;
- pgtable = get_pmd_huge_pte(tlb->mm);
+ pgtable = pgtable_withdraw(tlb->mm);
page = pmd_page(*pmd);
pmd_clear(pmd);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
@@ -1358,7 +1326,7 @@ static int __split_huge_page_map(struct page *page,
pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
if (pmd) {
- pgtable = get_pmd_huge_pte(mm);
+ pgtable = pgtable_withdraw(mm);
pmd_populate(mm, &_pmd, pgtable);

for (i = 0, haddr = address; i < HPAGE_PMD_NR;
@@ -1971,8 +1939,6 @@ static void collapse_huge_page(struct mm_struct *mm,
pte_unmap(pte);
__SetPageUptodate(new_page);
pgtable = pmd_pgtable(_pmd);
- VM_BUG_ON(page_count(pgtable) != 1);
- VM_BUG_ON(page_mapcount(pgtable) != 0);

_pmd = mk_pmd(new_page, vma->vm_page_prot);
_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
@@ -1990,7 +1956,7 @@ static void collapse_huge_page(struct mm_struct *mm,
page_add_new_anon_rmap(new_page, vma, address);
set_pmd_at(mm, address, pmd, _pmd);
update_mmu_cache(vma, address, _pmd);
- prepare_pmd_huge_pte(pgtable, mm);
+ pgtable_deposit(mm, pgtable);
spin_unlock(&mm->page_table_lock);

#ifndef CONFIG_NUMA
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 74c0dda..308f1fb 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -120,3 +120,42 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif
+
+#ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pgtable_deposit(struct mm_struct *mm, pgtable_t pgtable)
+{
+ assert_spin_locked(&mm->page_table_lock);
+
+ /* FIFO */
+ if (!mm->pmd_huge_pte)
+ INIT_LIST_HEAD(&pgtable->lru);
+ else
+ list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
+ mm->pmd_huge_pte = pgtable;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
+#ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/* no "address" argument so destroys page coloring of some arch */
+pgtable_t pgtable_withdraw(struct mm_struct *mm)
+{
+ pgtable_t pgtable;
+
+ assert_spin_locked(&mm->page_table_lock);
+
+ /* FIFO */
+ pgtable = mm->pmd_huge_pte;
+ if (list_empty(&pgtable->lru))
+ mm->pmd_huge_pte = NULL;
+ else {
+ mm->pmd_huge_pte = list_entry(pgtable->lru.next,
+ struct page, lru);
+ list_del(&pgtable->lru);
+ }
+ return pgtable;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
--
1.7.11.5

2012-08-29 15:33:57

by Gerald Schaefer

[permalink] [raw]
Subject: [RFC v2 PATCH 6/7] thp, s390: disable thp for kvm host on s390

This patch is part of the architecture backend for thp on s390.
It disables thp for kvm hosts, because there is no kvm host hugepage
support so far. Existing thp mappings are split by follow_page() with
FOLL_SPLIT, and future thp mappings are prevented by setting
VM_NOHUGEPAGE in mm->def_flags.

Signed-off-by: Gerald Schaefer <[email protected]>
---
arch/s390/mm/pgtable.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 3168fbd..7846072 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -787,6 +787,30 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
tlb_table_flush(tlb);
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void thp_split_vma(struct vm_area_struct *vma)
+{
+ unsigned long addr;
+ struct page *page;
+
+ for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+ page = follow_page(vma, addr, FOLL_SPLIT);
+ }
+}
+
+void thp_split_mm(struct mm_struct *mm)
+{
+ struct vm_area_struct *vma = mm->mmap;
+
+ while (vma != NULL) {
+ thp_split_vma(vma);
+ vma->vm_flags &= ~VM_HUGEPAGE;
+ vma->vm_flags |= VM_NOHUGEPAGE;
+ vma = vma->vm_next;
+ }
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
/*
* switch on pgstes for its userspace process (for kvm)
*/
@@ -824,6 +848,12 @@ int s390_enable_sie(void)
if (!mm)
return -ENOMEM;

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /* split thp mappings and disable thp for future mappings */
+ thp_split_mm(mm);
+ mm->def_flags |= VM_NOHUGEPAGE;
+#endif
+
/* Now lets check again if something happened */
task_lock(tsk);
if (!tsk->mm || atomic_read(&tsk->mm->mm_users) > 1 ||
--
1.7.11.5

2012-08-29 15:35:30

by Gerald Schaefer

[permalink] [raw]
Subject: [RFC v2 PATCH 5/7] thp, s390: thp pagetable pre-allocation for s390

This patch is part of the architecture backend for thp on s390.
It provides the pagetable pre-allocation functions pgtable_deposit()
and pgtable_withdraw(). Unlike other archs, s390 has no struct page *
as pgtable_t, but rather a pointer to the page table. So instead of
saving the pagetable pre-allocation list info inside the struct page,
it is being saved within the pagetable itself.

Signed-off-by: Gerald Schaefer <[email protected]>
---
arch/s390/include/asm/pgtable.h | 6 ++++++
arch/s390/mm/pgtable.c | 38 ++++++++++++++++++++++++++++++++++++++
2 files changed, 44 insertions(+)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index c3f0775..353590c 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1166,6 +1166,12 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
#define pte_unmap(pte) do { } while (0)

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define __HAVE_ARCH_PGTABLE_DEPOSIT
+extern void pgtable_deposit(struct mm_struct *mm, pgtable_t pgtable);
+
+#define __HAVE_ARCH_PGTABLE_WITHDRAW
+extern pgtable_t pgtable_withdraw(struct mm_struct *mm);
+
static inline int pmd_trans_splitting(pmd_t pmd)
{
return pmd_val(pmd) & _SEGMENT_ENTRY_SPLIT;
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 5a73522..3168fbd 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -883,4 +883,42 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
smp_call_function(pmdp_splitting_flush_sync, NULL, 1);
}
}
+
+void pgtable_deposit(struct mm_struct *mm, pgtable_t pgtable)
+{
+ struct list_head *lh = (struct list_head *) pgtable;
+
+ assert_spin_locked(&mm->page_table_lock);
+
+ /* FIFO */
+ if (!mm->pmd_huge_pte)
+ INIT_LIST_HEAD(lh);
+ else
+ list_add(lh, (struct list_head *) mm->pmd_huge_pte);
+ mm->pmd_huge_pte = pgtable;
+}
+
+pgtable_t pgtable_withdraw(struct mm_struct *mm)
+{
+ struct list_head *lh;
+ pgtable_t pgtable;
+ pte_t *ptep;
+
+ assert_spin_locked(&mm->page_table_lock);
+
+ /* FIFO */
+ pgtable = mm->pmd_huge_pte;
+ lh = (struct list_head *) pgtable;
+ if (list_empty(lh))
+ mm->pmd_huge_pte = NULL;
+ else {
+ mm->pmd_huge_pte = (pgtable_t) lh->next;
+ list_del(lh);
+ }
+ ptep = (pte_t *) pgtable;
+ pte_val(*ptep) = _PAGE_TYPE_EMPTY;
+ ptep++;
+ pte_val(*ptep) = _PAGE_TYPE_EMPTY;
+ return pgtable;
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
--
1.7.11.5

2012-08-29 15:35:34

by Gerald Schaefer

[permalink] [raw]
Subject: [RFC v2 PATCH 7/7] thp, s390: architecture backend for thp on s390

This implements the architecture backend for transparent hugepages
on s390.

Signed-off-by: Gerald Schaefer <[email protected]>
---
arch/s390/include/asm/hugetlb.h | 18 +---
arch/s390/include/asm/pgtable.h | 189 ++++++++++++++++++++++++++++++++++++++++
arch/s390/include/asm/setup.h | 3 +
arch/s390/include/asm/tlb.h | 1 +
arch/s390/kernel/early.c | 2 +
arch/s390/mm/pgtable.c | 22 +++++
mm/Kconfig | 2 +-
7 files changed, 219 insertions(+), 18 deletions(-)

diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index 799ed0f..a840662 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -87,23 +87,6 @@ static inline void __pmd_csp(pmd_t *pmdp)
" csp %1,%3"
: "=m" (*pmdp)
: "d" (reg2), "d" (reg3), "d" (reg4), "m" (*pmdp) : "cc");
- pmd_val(*pmdp) = _SEGMENT_ENTRY_INV | _SEGMENT_ENTRY;
-}
-
-static inline void __pmd_idte(unsigned long address, pmd_t *pmdp)
-{
- unsigned long sto = (unsigned long) pmdp -
- pmd_index(address) * sizeof(pmd_t);
-
- if (!(pmd_val(*pmdp) & _SEGMENT_ENTRY_INV)) {
- asm volatile(
- " .insn rrf,0xb98e0000,%2,%3,0,0"
- : "=m" (*pmdp)
- : "m" (*pmdp), "a" (sto),
- "a" ((address & HPAGE_MASK))
- );
- }
- pmd_val(*pmdp) = _SEGMENT_ENTRY_INV | _SEGMENT_ENTRY;
}

static inline void huge_ptep_invalidate(struct mm_struct *mm,
@@ -115,6 +98,7 @@ static inline void huge_ptep_invalidate(struct mm_struct *mm,
__pmd_idte(address, pmdp);
else
__pmd_csp(pmdp);
+ pmd_val(*pmdp) = _SEGMENT_ENTRY_INV | _SEGMENT_ENTRY;
}

#define huge_ptep_set_access_flags(__vma, __addr, __ptep, __entry, __dirty) \
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 353590c..d51b5cb 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -350,6 +350,10 @@ extern struct page *vmemmap;
#define _SEGMENT_ENTRY_SPLIT_BIT 0 /* THP splitting bit number */
#define _SEGMENT_ENTRY_SPLIT (1UL << _SEGMENT_ENTRY_SPLIT_BIT)

+/* Set of bits not changed in pmd_modify */
+#define _SEGMENT_CHG_MASK (_SEGMENT_ENTRY_ORIGIN | _SEGMENT_ENTRY_LARGE \
+ | _SEGMENT_ENTRY_SPLIT | _SEGMENT_ENTRY_CO)
+
/* Page status table bits for virtualization */
#define RCP_ACC_BITS 0xf000000000000000UL
#define RCP_FP_BIT 0x0800000000000000UL
@@ -512,6 +516,26 @@ static inline int pmd_bad(pmd_t pmd)
extern void pmdp_splitting_flush(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp);

+#define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+extern int pmdp_set_access_flags(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp,
+ pmd_t entry, int dirty);
+
+#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMD_WRITE
+static inline int pmd_write(pmd_t pmd)
+{
+ return (pmd_val(pmd) & _SEGMENT_ENTRY_RO) == 0;
+}
+
+static inline int pmd_young(pmd_t pmd)
+{
+ return 0;
+}
+
static inline int pte_none(pte_t pte)
{
return (pte_val(pte) & _PAGE_INVALID) && !(pte_val(pte) & _PAGE_SWT);
@@ -1165,6 +1189,22 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
#define pte_offset_map(pmd, address) pte_offset_kernel(pmd, address)
#define pte_unmap(pte) do { } while (0)

+static inline void __pmd_idte(unsigned long address, pmd_t *pmdp)
+{
+ unsigned long sto = (unsigned long) pmdp -
+ pmd_index(address) * sizeof(pmd_t);
+
+ if (!(pmd_val(*pmdp) & _SEGMENT_ENTRY_INV)) {
+ asm volatile(
+ " .insn rrf,0xb98e0000,%2,%3,0,0"
+ : "=m" (*pmdp)
+ : "m" (*pmdp), "a" (sto),
+ "a" ((address & HPAGE_MASK))
+ : "cc"
+ );
+ }
+}
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define __HAVE_ARCH_PGTABLE_DEPOSIT
extern void pgtable_deposit(struct mm_struct *mm, pgtable_t pgtable);
@@ -1176,6 +1216,155 @@ static inline int pmd_trans_splitting(pmd_t pmd)
{
return pmd_val(pmd) & _SEGMENT_ENTRY_SPLIT;
}
+
+static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+ pmd_t *pmdp, pmd_t entry)
+{
+ *pmdp = entry;
+}
+
+static inline unsigned long massage_pgprot_pmd(pgprot_t pgprot)
+{
+ unsigned long pgprot_pmd = 0;
+
+ if (pgprot_val(pgprot) & _PAGE_INVALID) {
+ if (pgprot_val(pgprot) & _PAGE_SWT)
+ pgprot_pmd |= _HPAGE_TYPE_NONE;
+ pgprot_pmd |= _SEGMENT_ENTRY_INV;
+ }
+ if (pgprot_val(pgprot) & _PAGE_RO)
+ pgprot_pmd |= _SEGMENT_ENTRY_RO;
+ return pgprot_pmd;
+}
+
+static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
+{
+ pmd_val(pmd) &= _SEGMENT_CHG_MASK;
+ pmd_val(pmd) |= massage_pgprot_pmd(newprot);
+ return pmd;
+}
+
+static inline pmd_t pmd_mkhuge(pmd_t pmd)
+{
+ pmd_val(pmd) |= _SEGMENT_ENTRY_LARGE;
+ return pmd;
+}
+
+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+ pmd_val(pmd) &= ~_SEGMENT_ENTRY_RO;
+ return pmd;
+}
+
+static inline pmd_t pmd_wrprotect(pmd_t pmd)
+{
+ pmd_val(pmd) |= _SEGMENT_ENTRY_RO;
+ return pmd;
+}
+
+static inline pmd_t pmd_mkdirty(pmd_t pmd)
+{
+ /* No dirty bit in the segment table entry. */
+ return pmd;
+}
+
+static inline pmd_t pmd_mkold(pmd_t pmd)
+{
+ /* No referenced bit in the segment table entry. */
+ return pmd;
+}
+
+static inline pmd_t pmd_mkyoung(pmd_t pmd)
+{
+ /* No referenced bit in the segment table entry. */
+ return pmd;
+}
+
+#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp)
+{
+ unsigned long pmd_addr = pmd_val(*pmdp) & HPAGE_MASK;
+ long tmp, rc;
+ int counter;
+
+ rc = 0;
+ if (MACHINE_HAS_RRBM) {
+ counter = PTRS_PER_PTE >> 6;
+ asm volatile(
+ "0: .insn rre,0xb9ae0000,%0,%3\n" /* rrbm */
+ " ogr %1,%0\n"
+ " la %3,0(%4,%3)\n"
+ " brct %2,0b\n"
+ : "=d" (tmp), "+d" (rc), "+d" (counter), "+a" (pmd_addr)
+ : "a" (64 * 4096UL) : "cc");
+ rc = !!rc;
+ } else {
+ counter = PTRS_PER_PTE;
+ asm volatile(
+ "0: rrbe 0,%2\n"
+ " la %2,0(%3,%2)\n"
+ " brc 12,1f\n"
+ " lhi %0,1\n"
+ "1: brct %1,0b\n"
+ : "+d" (rc), "+d" (counter), "+a" (pmd_addr)
+ : "a" (4096UL) : "cc");
+ }
+ return rc;
+}
+
+#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm,
+ unsigned long address, pmd_t *pmdp)
+{
+ pmd_t pmd = *pmdp;
+
+ __pmd_idte(address, pmdp);
+ pmd_clear(pmdp);
+ return pmd;
+}
+
+#define __HAVE_ARCH_PMDP_CLEAR_FLUSH
+static inline pmd_t pmdp_clear_flush(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp)
+{
+ return pmdp_get_and_clear(vma->vm_mm, address, pmdp);
+}
+
+#define __HAVE_ARCH_PMDP_INVALIDATE
+static inline void pmdp_invalidate(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp)
+{
+ __pmd_idte(address, pmdp);
+}
+
+static inline pmd_t mk_pmd_phys(unsigned long physpage, pgprot_t pgprot)
+{
+ pmd_t __pmd;
+ pmd_val(__pmd) = physpage + massage_pgprot_pmd(pgprot);
+ return __pmd;
+}
+
+#define pfn_pmd(pfn, pgprot) mk_pmd_phys(__pa((pfn) << PAGE_SHIFT),(pgprot))
+#define mk_pmd(page, pgprot) pfn_pmd(page_to_pfn(page), (pgprot))
+
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+ return pmd_val(pmd) & _SEGMENT_ENTRY_LARGE;
+}
+
+static inline int has_transparent_hugepage(void)
+{
+ return MACHINE_HAS_HPAGE ? 1 : 0;
+}
+
+static inline unsigned long pmd_pfn(pmd_t pmd)
+{
+ if (pmd_trans_huge(pmd))
+ return pmd_val(pmd) >> HPAGE_SHIFT;
+ else
+ return pmd_val(pmd) >> PAGE_SHIFT;
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

/*
diff --git a/arch/s390/include/asm/setup.h b/arch/s390/include/asm/setup.h
index e6859d1..13b545f 100644
--- a/arch/s390/include/asm/setup.h
+++ b/arch/s390/include/asm/setup.h
@@ -80,6 +80,7 @@ extern unsigned int addressing_mode;
#define MACHINE_FLAG_LPAR (1UL << 12)
#define MACHINE_FLAG_SPP (1UL << 13)
#define MACHINE_FLAG_TOPOLOGY (1UL << 14)
+#define MACHINE_FLAG_RRBM (1UL << 16)

#define MACHINE_IS_VM (S390_lowcore.machine_flags & MACHINE_FLAG_VM)
#define MACHINE_IS_KVM (S390_lowcore.machine_flags & MACHINE_FLAG_KVM)
@@ -98,6 +99,7 @@ extern unsigned int addressing_mode;
#define MACHINE_HAS_PFMF (0)
#define MACHINE_HAS_SPP (0)
#define MACHINE_HAS_TOPOLOGY (0)
+#define MACHINE_HAS_RRBM (0)
#else /* CONFIG_64BIT */
#define MACHINE_HAS_IEEE (1)
#define MACHINE_HAS_CSP (1)
@@ -109,6 +111,7 @@ extern unsigned int addressing_mode;
#define MACHINE_HAS_PFMF (S390_lowcore.machine_flags & MACHINE_FLAG_PFMF)
#define MACHINE_HAS_SPP (S390_lowcore.machine_flags & MACHINE_FLAG_SPP)
#define MACHINE_HAS_TOPOLOGY (S390_lowcore.machine_flags & MACHINE_FLAG_TOPOLOGY)
+#define MACHINE_HAS_RRBM (S390_lowcore.machine_flags & MACHINE_FLAG_RRBM)
#endif /* CONFIG_64BIT */

#define ZFCPDUMP_HSA_SIZE (32UL<<20)
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 06e5acb..b75d7d6 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -137,6 +137,7 @@ static inline void pud_free_tlb(struct mmu_gather *tlb, pud_t *pud,
#define tlb_start_vma(tlb, vma) do { } while (0)
#define tlb_end_vma(tlb, vma) do { } while (0)
#define tlb_remove_tlb_entry(tlb, ptep, addr) do { } while (0)
+#define tlb_remove_pmd_tlb_entry(tlb, pmdp, addr) do { } while (0)
#define tlb_migrate_finish(mm) do { } while (0)

#endif /* _S390_TLB_H */
diff --git a/arch/s390/kernel/early.c b/arch/s390/kernel/early.c
index 83c3271..fb03cce 100644
--- a/arch/s390/kernel/early.c
+++ b/arch/s390/kernel/early.c
@@ -372,6 +372,8 @@ static __init void detect_machine_facilities(void)
S390_lowcore.machine_flags |= MACHINE_FLAG_MVCOS;
if (test_facility(40))
S390_lowcore.machine_flags |= MACHINE_FLAG_SPP;
+ if (test_facility(66))
+ S390_lowcore.machine_flags |= MACHINE_FLAG_RRBM;
#endif
}

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 7846072..216254b 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -898,6 +898,28 @@ bool kernel_page_present(struct page *page)
#endif /* CONFIG_HIBERNATION && CONFIG_DEBUG_PAGEALLOC */

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int pmdp_clear_flush_young(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmdp)
+{
+ VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+ /* No need to flush TLB
+ * On s390 reference bits are in storage key and never in TLB */
+ return pmdp_test_and_clear_young(vma, address, pmdp);
+}
+
+int pmdp_set_access_flags(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp,
+ pmd_t entry, int dirty)
+{
+ VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+ if (pmd_same(*pmdp, entry))
+ return 0;
+ pmdp_invalidate(vma, address, pmdp);
+ set_pmd_at(vma->vm_mm, address, pmdp, entry);
+ return 1;
+}
+
static void pmdp_splitting_flush_sync(void *arg)
{
/* Simply deliver the interrupt */
diff --git a/mm/Kconfig b/mm/Kconfig
index d5c8019..7d51edb 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -318,7 +318,7 @@ config NOMMU_INITIAL_TRIM_EXCESS

config TRANSPARENT_HUGEPAGE
bool "Transparent Hugepage Support"
- depends on X86 && MMU
+ depends on (X86 || (S390 && 64BIT)) && MMU
select COMPACTION
help
Transparent Hugepages allows the kernel to use huge pages and
--
1.7.11.5

2012-08-29 15:35:31

by Gerald Schaefer

[permalink] [raw]
Subject: [RFC v2 PATCH 2/7] thp: introduce pmdp_invalidate()

On s390, a valid page table entry must not be changed while it is
attached to any CPU. So instead of pmd_mknotpresent() and set_pmd_at(),
an IDTE operation would be necessary there. This patch introduces the
pmdp_invalidate() function, to allow architecture-specific
implementations.

Signed-off-by: Gerald Schaefer <[email protected]>
---
include/asm-generic/pgtable.h | 5 +++++
mm/huge_memory.c | 3 +--
mm/pgtable-generic.c | 11 +++++++++++
3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f756f60..47519ef 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -170,6 +170,11 @@ extern void pgtable_deposit(struct mm_struct *mm, pgtable_t pgtable);
extern pgtable_t pgtable_withdraw(struct mm_struct *mm);
#endif

+#ifndef __HAVE_ARCH_PMDP_INVALIDATE
+extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmdp);
+#endif
+
#ifndef __HAVE_ARCH_PTE_SAME
static inline int pte_same(pte_t pte_a, pte_t pte_b)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6805328..8bac516 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1374,8 +1374,7 @@ static int __split_huge_page_map(struct page *page,
* SMP TLB and finally we write the non-huge version
* of the pmd entry with pmd_populate.
*/
- set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
- flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ pmdp_invalidate(vma, address, pmd);
pmd_populate(mm, pmd, pgtable);
ret = 1;
}
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 308f1fb..ca01b4b 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -159,3 +159,14 @@ pgtable_t pgtable_withdraw(struct mm_struct *mm)
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif
+
+#ifndef __HAVE_ARCH_PMDP_INVALIDATE
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmdp)
+{
+ set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(*pmdp));
+ flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
--
1.7.11.5

2012-08-30 19:54:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 0/7] thp: transparent hugepages on s390

On Wed, 29 Aug 2012 17:32:57 +0200
Gerald Schaefer <[email protected]> wrote:

> This patch series adds support for transparent hugepages on System z.
> Small changes to common code are necessary with regard to a different
> pgtable_t, tlb flushing and kvm behaviour on s390, see patches 1 to 3.

"RFC" always worries me. I read it as "Really Flakey Code" ;) Is it
still appropriate to this patchset?

I grabbed them all. Patches 1-3 look sane to me and I cheerfully
didn't read the s390 changes at all. Hopefully Andrea will be able to
review at least patches 1-3 for us.

If that all goes well, how do we play this? I'd prefer to merge 1-3
myself, as they do interact with ongoing MM development. I can also
merge 4-7 if appropriate s390 maintainer acks are seen. Or I can drop
them and the s390 parts can be merged via the s390 tree at a later
date?

2012-08-31 05:52:39

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 1/7] thp: remove assumptions on pgtable_t type

Gerald Schaefer <[email protected]> writes:

> The thp page table pre-allocation code currently assumes that pgtable_t
> is of type "struct page *". This may not be true for all architectures,
> so this patch removes that assumption by replacing the functions
> prepare_pmd_huge_pte() and get_pmd_huge_pte() with two new functions
> that can be defined architecture-specific.
>
> It also removes two VM_BUG_ON checks for page_count() and page_mapcount()
> operating on a pgtable_t. Apart from the VM_BUG_ON removal, there will
> be no functional change introduced by this patch.

Why is that VM_BUG_ON not needed any more ? What is that changed which break
that requirement ?

-aneesh

2012-08-31 07:08:15

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 0/7] thp: transparent hugepages on s390

On Thu, 30 Aug 2012 12:54:44 -0700
Andrew Morton <[email protected]> wrote:

> On Wed, 29 Aug 2012 17:32:57 +0200
> Gerald Schaefer <[email protected]> wrote:
>
> > This patch series adds support for transparent hugepages on System z.
> > Small changes to common code are necessary with regard to a different
> > pgtable_t, tlb flushing and kvm behaviour on s390, see patches 1 to 3.
>
> "RFC" always worries me. I read it as "Really Flakey Code" ;) Is it
> still appropriate to this patchset?

The code quality is IMHO already good. We do change common mm code though
and we want to get some feedback on that.

> I grabbed them all. Patches 1-3 look sane to me and I cheerfully
> didn't read the s390 changes at all. Hopefully Andrea will be able to
> review at least patches 1-3 for us.
>
> If that all goes well, how do we play this? I'd prefer to merge 1-3
> myself, as they do interact with ongoing MM development. I can also
> merge 4-7 if appropriate s390 maintainer acks are seen. Or I can drop
> them and the s390 parts can be merged via the s390 tree at a later
> date?

I would really appreciate if Andrea could have a look at the code. I've
read the patches and I am fine with them but it is very easy to miss some
important bit.

As far as upstreaming is concerned: I can deal with the pure s390 parts
via the s390 tree if that helps you. If you prefer the carry all of them,
that is fine with me as well.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2012-08-31 07:10:22

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 1/7] thp: remove assumptions on pgtable_t type

On Fri, 31 Aug 2012 10:59:38 +0530
"Aneesh Kumar K.V" <[email protected]> wrote:

> Gerald Schaefer <[email protected]> writes:
>
> > The thp page table pre-allocation code currently assumes that pgtable_t
> > is of type "struct page *". This may not be true for all architectures,
> > so this patch removes that assumption by replacing the functions
> > prepare_pmd_huge_pte() and get_pmd_huge_pte() with two new functions
> > that can be defined architecture-specific.
> >
> > It also removes two VM_BUG_ON checks for page_count() and page_mapcount()
> > operating on a pgtable_t. Apart from the VM_BUG_ON removal, there will
> > be no functional change introduced by this patch.
>
> Why is that VM_BUG_ON not needed any more ? What is that changed which break
> that requirement ?

Because pgtable_t for s390 is not a page and there simply is no page_count or
page_mapcount.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2012-08-31 19:47:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC v2 PATCH 0/7] thp: transparent hugepages on s390

On Fri, 31 Aug 2012 09:07:57 +0200
Martin Schwidefsky <[email protected]> wrote:

> > I grabbed them all. Patches 1-3 look sane to me and I cheerfully
> > didn't read the s390 changes at all. Hopefully Andrea will be able to
> > review at least patches 1-3 for us.
> >
> > If that all goes well, how do we play this? I'd prefer to merge 1-3
> > myself, as they do interact with ongoing MM development. I can also
> > merge 4-7 if appropriate s390 maintainer acks are seen. Or I can drop
> > them and the s390 parts can be merged via the s390 tree at a later
> > date?
>
> I would really appreciate if Andrea could have a look at the code.

Yes please ;)

> I've
> read the patches and I am fine with them but it is very easy to miss some
> important bit.
>
> As far as upstreaming is concerned: I can deal with the pure s390 parts
> via the s390 tree if that helps you. If you prefer the carry all of them,
> that is fine with me as well.

It would be easiest/simplest if I were to merge it all. But that means
that the s390 developers wouldn't have tested it much, unless they're
testing linux-next?

I guess that's not the end of the world - you'll have a couple of
months after -rc1 to find and fix any problems.