This patch series makes swapin readahead up to a
certain number to gain more thp performance and adds
tracepoint for khugepaged_scan_pmd, collapse_huge_page,
__collapse_huge_page_isolate.
This patch series was written to deal with programs
that access most, but not all, of their memory after
they get swapped out. Currently these programs do not
get their memory collapsed into THPs after the system
swapped their memory out, while they would get THPs
before swapping happened.
This patch series was tested with a test program,
it allocates 800MB of memory, writes to it, and
then sleeps. I force the system to swap out all.
Afterwards, the test program touches the area by
writing and leaves a piece of it without writing.
This shows how much swap in readahead made by the
patch.
Test results:
After swapped out
-------------------------------------------------------------------
| Anonymous | AnonHugePages | Swap | Fraction |
-------------------------------------------------------------------
With patch | 265772 kB | 264192 kB | 534232 kB | %99 |
-------------------------------------------------------------------
Without patch | 238160 kB | 235520 kB | 561844 kB | %98 |
-------------------------------------------------------------------
After swapped in
-------------------------------------------------------------------
| Anonymous | AnonHugePages | Swap | Fraction |
-------------------------------------------------------------------
With patch | 532756 kB | 528384 kB | 267248 kB | %99 |
-------------------------------------------------------------------
Without patch | 499956 kB | 235520 kB | 300048 kB | %47 |
-------------------------------------------------------------------
Ebru Akagunduz (3):
mm: add tracepoint for scanning pages
mm: make optimistic check for swapin readahead
mm: make swapin readahead to improve thp collapse rate
include/linux/mm.h | 4 ++
include/trace/events/huge_memory.h | 123 +++++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 58 ++++++++++++++++-
mm/memory.c | 2 +-
4 files changed, 183 insertions(+), 4 deletions(-)
create mode 100644 include/trace/events/huge_memory.h
--
1.9.1
Using static tracepoints, data of functions is recorded.
It is good to automatize debugging without doing a lot
of changes in the source code.
This patch adds tracepoint for khugepaged_scan_pmd,
collapse_huge_page and __collapse_huge_page_isolate.
Signed-off-by: Ebru Akagunduz <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
Changes in v2:
- Nothing changed
include/trace/events/huge_memory.h | 96 ++++++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 10 +++-
2 files changed, 105 insertions(+), 1 deletion(-)
create mode 100644 include/trace/events/huge_memory.h
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
new file mode 100644
index 0000000..4b9049b
--- /dev/null
+++ b/include/trace/events/huge_memory.h
@@ -0,0 +1,96 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM huge_memory
+
+#if !defined(__HUGE_MEMORY_H) || defined(TRACE_HEADER_MULTI_READ)
+#define __HUGE_MEMORY_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(mm_khugepaged_scan_pmd,
+
+ TP_PROTO(struct mm_struct *mm, unsigned long vm_start, bool writable,
+ bool referenced, int none_or_zero, int collapse),
+
+ TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, vm_start)
+ __field(bool, writable)
+ __field(bool, referenced)
+ __field(int, none_or_zero)
+ __field(int, collapse)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->vm_start = vm_start;
+ __entry->writable = writable;
+ __entry->referenced = referenced;
+ __entry->none_or_zero = none_or_zero;
+ __entry->collapse = collapse;
+ ),
+
+ TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d",
+ __entry->mm,
+ __entry->vm_start,
+ __entry->writable,
+ __entry->referenced,
+ __entry->none_or_zero,
+ __entry->collapse)
+);
+
+TRACE_EVENT(mm_collapse_huge_page,
+
+ TP_PROTO(struct mm_struct *mm, unsigned long vm_start, int isolated),
+
+ TP_ARGS(mm, vm_start, isolated),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, vm_start)
+ __field(int, isolated)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->vm_start = vm_start;
+ __entry->isolated = isolated;
+ ),
+
+ TP_printk("mm=%p, vm_start=%04lx, isolated=%d",
+ __entry->mm,
+ __entry->vm_start,
+ __entry->isolated)
+);
+
+TRACE_EVENT(mm_collapse_huge_page_isolate,
+
+ TP_PROTO(unsigned long vm_start, int none_or_zero,
+ bool referenced, bool writable),
+
+ TP_ARGS(vm_start, none_or_zero, referenced, writable),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, vm_start)
+ __field(int, none_or_zero)
+ __field(bool, referenced)
+ __field(bool, writable)
+ ),
+
+ TP_fast_assign(
+ __entry->vm_start = vm_start;
+ __entry->none_or_zero = none_or_zero;
+ __entry->referenced = referenced;
+ __entry->writable = writable;
+ ),
+
+ TP_printk("vm_start=%04lx, none_or_zero=%d, referenced=%d, writable=%d",
+ __entry->vm_start,
+ __entry->none_or_zero,
+ __entry->referenced,
+ __entry->writable)
+);
+
+#endif /* __HUGE_MEMORY_H */
+#include <trace/define_trace.h>
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9671f51..9bb97fc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -29,6 +29,9 @@
#include <asm/pgalloc.h>
#include "internal.h"
+#define CREATE_TRACE_POINTS
+#include <trace/events/huge_memory.h>
+
/*
* By default transparent hugepage support is disabled in order that avoid
* to risk increase the memory footprint of applications without a guaranteed
@@ -2266,6 +2269,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
if (likely(referenced && writable))
return 1;
out:
+ trace_mm_collapse_huge_page_isolate(vma->vm_start, none_or_zero,
+ referenced, writable);
release_pte_pages(pte, _pte);
return 0;
}
@@ -2501,7 +2506,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pgtable_t pgtable;
struct page *new_page;
spinlock_t *pmd_ptl, *pte_ptl;
- int isolated;
+ int isolated = 0;
unsigned long hstart, hend;
struct mem_cgroup *memcg;
unsigned long mmun_start; /* For mmu_notifiers */
@@ -2619,6 +2624,7 @@ static void collapse_huge_page(struct mm_struct *mm,
khugepaged_pages_collapsed++;
out_up_write:
up_write(&mm->mmap_sem);
+ trace_mm_collapse_huge_page(mm, vma->vm_start, isolated);
return;
out:
@@ -2694,6 +2700,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
ret = 1;
out_unmap:
pte_unmap_unlock(pte, ptl);
+ trace_mm_khugepaged_scan_pmd(mm, vma->vm_start, writable, referenced,
+ none_or_zero, ret);
if (ret) {
node = khugepaged_find_target_node();
/* collapse_huge_page will return with the mmap_sem released */
--
1.9.1
This patch makes optimistic check for swapin readahead
to increase thp collapse rate. Before getting swapped
out pages to memory, checks them and allows up to a
certain number. It also prints out using tracepoints
amount of unmapped ptes.
Signed-off-by: Ebru Akagunduz <[email protected]>
---
Changes in v2:
- Nothing changed
include/trace/events/huge_memory.h | 11 +++++++----
mm/huge_memory.c | 13 ++++++++++---
2 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 4b9049b..53c9f2e 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -9,9 +9,9 @@
TRACE_EVENT(mm_khugepaged_scan_pmd,
TP_PROTO(struct mm_struct *mm, unsigned long vm_start, bool writable,
- bool referenced, int none_or_zero, int collapse),
+ bool referenced, int none_or_zero, int collapse, int unmapped),
- TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse),
+ TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse, unmapped),
TP_STRUCT__entry(
__field(struct mm_struct *, mm)
@@ -20,6 +20,7 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
__field(bool, referenced)
__field(int, none_or_zero)
__field(int, collapse)
+ __field(int, unmapped)
),
TP_fast_assign(
@@ -29,15 +30,17 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
__entry->referenced = referenced;
__entry->none_or_zero = none_or_zero;
__entry->collapse = collapse;
+ __entry->unmapped = unmapped;
),
- TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d",
+ TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d, unmapped=%d",
__entry->mm,
__entry->vm_start,
__entry->writable,
__entry->referenced,
__entry->none_or_zero,
- __entry->collapse)
+ __entry->collapse,
+ __entry->unmapped)
);
TRACE_EVENT(mm_collapse_huge_page,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9bb97fc..22bc0bf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -24,6 +24,7 @@
#include <linux/migrate.h>
#include <linux/hashtable.h>
#include <linux/userfaultfd_k.h>
+#include <linux/swapops.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
{
pmd_t *pmd;
pte_t *pte, *_pte;
- int ret = 0, none_or_zero = 0;
+ int ret = 0, none_or_zero = 0, unmapped = 0;
struct page *page;
unsigned long _address;
spinlock_t *ptl;
- int node = NUMA_NO_NODE;
+ int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8;
bool writable = false, referenced = false;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
@@ -2657,6 +2658,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
_pte++, _address += PAGE_SIZE) {
pte_t pteval = *_pte;
+ if (is_swap_pte(pteval)) {
+ if (++unmapped <= max_ptes_swap)
+ continue;
+ else
+ goto out_unmap;
+ }
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
if (!userfaultfd_armed(vma) &&
++none_or_zero <= khugepaged_max_ptes_none)
@@ -2701,7 +2708,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
out_unmap:
pte_unmap_unlock(pte, ptl);
trace_mm_khugepaged_scan_pmd(mm, vma->vm_start, writable, referenced,
- none_or_zero, ret);
+ none_or_zero, ret, unmapped);
if (ret) {
node = khugepaged_find_target_node();
/* collapse_huge_page will return with the mmap_sem released */
--
1.9.1
This patch makes swapin readahead to improve thp collapse rate.
When khugepaged scanned pages, there can be a few of the pages
in swap area.
With the patch THP can collapse 4kB pages into a THP when
there are up to max_ptes_swap swap ptes in a 2MB range.
The patch was tested with a test program that allocates
800MB of memory, writes to it, and then sleeps. I force
the system to swap out all. Afterwards, the test program
touches the area by writing, it skips a page in each
20 pages of the area.
Without the patch, system did not swap in readahead.
THP rate was %47 of the program of the memory, it
did not change over time.
With this patch, after 10 minutes of waiting khugepaged had
collapsed %99 of the program's memory.
Signed-off-by: Ebru Akagunduz <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
Changes in v2:
- Use FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT flag
instead of 0x0 when called do_swap_page
from __collapse_huge_page_swapin
Test results:
After swapped out
-------------------------------------------------------------------
| Anonymous | AnonHugePages | Swap | Fraction |
-------------------------------------------------------------------
With patch | 265772 kB | 264192 kB | 534232 kB | %99 |
-------------------------------------------------------------------
Without patch | 238160 kB | 235520 kB | 561844 kB | %98 |
-------------------------------------------------------------------
After swapped in
-------------------------------------------------------------------
| Anonymous | AnonHugePages | Swap | Fraction |
-------------------------------------------------------------------
With patch | 532756 kB | 528384 kB | 267248 kB | %99 |
-------------------------------------------------------------------
Without patch | 499956 kB | 235520 kB | 300048 kB | %47 |
-------------------------------------------------------------------
include/linux/mm.h | 4 ++++
include/trace/events/huge_memory.h | 24 ++++++++++++++++++++++++
mm/huge_memory.c | 37 +++++++++++++++++++++++++++++++++++++
mm/memory.c | 2 +-
4 files changed, 66 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f47178..f66ff8a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -29,6 +29,10 @@ struct user_struct;
struct writeback_control;
struct bdi_writeback;
+extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ unsigned int flags, pte_t orig_pte);
+
#ifndef CONFIG_NEED_MULTIPLE_NODES /* Don't use mapnrs, do it properly */
extern unsigned long max_mapnr;
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 53c9f2e..0117ab9 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -95,5 +95,29 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
__entry->writable)
);
+TRACE_EVENT(mm_collapse_huge_page_swapin,
+
+ TP_PROTO(struct mm_struct *mm, unsigned long vm_start, int swap_pte),
+
+ TP_ARGS(mm, vm_start, swap_pte),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, vm_start)
+ __field(int, swap_pte)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->vm_start = vm_start;
+ __entry->swap_pte = swap_pte;
+ ),
+
+ TP_printk("mm=%p, vm_start=%04lx, swap_pte=%d",
+ __entry->mm,
+ __entry->vm_start,
+ __entry->swap_pte)
+);
+
#endif /* __HUGE_MEMORY_H */
#include <trace/define_trace.h>
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 22bc0bf..064fd72 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2496,6 +2496,41 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
return true;
}
+/*
+ * Bring missing pages in from swap, to complete THP collapse.
+ * Only done if khugepaged_scan_pmd believes it is worthwhile.
+ *
+ * Called and returns without pte mapped or spinlocks held,
+ * but with mmap_sem held to protect against vma changes.
+ */
+
+static void __collapse_huge_page_swapin(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ pte_t *pte)
+{
+ unsigned long _address;
+ pte_t pteval = *pte;
+ int swap_pte = 0;
+
+ pte = pte_offset_map(pmd, address);
+ for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
+ pte++, _address += PAGE_SIZE) {
+ pteval = *pte;
+ if (is_swap_pte(pteval)) {
+ swap_pte++;
+ do_swap_page(mm, vma, _address, pte, pmd,
+ FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
+ pteval);
+ /* pte is unmapped now, we need to map it */
+ pte = pte_offset_map(pmd, _address);
+ }
+ }
+ pte--;
+ pte_unmap(pte);
+ trace_mm_collapse_huge_page_swapin(mm, vma->vm_start, swap_pte);
+}
+
static void collapse_huge_page(struct mm_struct *mm,
unsigned long address,
struct page **hpage,
@@ -2551,6 +2586,8 @@ static void collapse_huge_page(struct mm_struct *mm,
if (!pmd)
goto out;
+ __collapse_huge_page_swapin(mm, vma, address, pmd, pte);
+
anon_vma_lock_write(vma->anon_vma);
pte = pte_offset_map(pmd, address);
diff --git a/mm/memory.c b/mm/memory.c
index e1c45d0..d801dc5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2443,7 +2443,7 @@ EXPORT_SYMBOL(unmap_mapping_range);
* We return with the mmap_sem locked or unlocked in the same cases
* as does filemap_fault().
*/
-static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
+int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *page_table, pmd_t *pmd,
unsigned int flags, pte_t orig_pte)
{
--
1.9.1
On Sat, Jun 20, 2015 at 02:28:04PM +0300, Ebru Akagunduz wrote:
> Using static tracepoints, data of functions is recorded.
> It is good to automatize debugging without doing a lot
> of changes in the source code.
>
> This patch adds tracepoint for khugepaged_scan_pmd,
> collapse_huge_page and __collapse_huge_page_isolate.
>
> Signed-off-by: Ebru Akagunduz <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Sat, Jun 20, 2015 at 02:28:05PM +0300, Ebru Akagunduz wrote:
> This patch makes optimistic check for swapin readahead
> to increase thp collapse rate. Before getting swapped
> out pages to memory, checks them and allows up to a
> certain number. It also prints out using tracepoints
> amount of unmapped ptes.
>
> Signed-off-by: Ebru Akagunduz <[email protected]>
> ---
> Changes in v2:
> - Nothing changed
>
> include/trace/events/huge_memory.h | 11 +++++++----
> mm/huge_memory.c | 13 ++++++++++---
> 2 files changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 4b9049b..53c9f2e 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -9,9 +9,9 @@
> TRACE_EVENT(mm_khugepaged_scan_pmd,
>
> TP_PROTO(struct mm_struct *mm, unsigned long vm_start, bool writable,
> - bool referenced, int none_or_zero, int collapse),
> + bool referenced, int none_or_zero, int collapse, int unmapped),
>
> - TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse),
> + TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse, unmapped),
>
> TP_STRUCT__entry(
> __field(struct mm_struct *, mm)
> @@ -20,6 +20,7 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
> __field(bool, referenced)
> __field(int, none_or_zero)
> __field(int, collapse)
> + __field(int, unmapped)
> ),
>
> TP_fast_assign(
> @@ -29,15 +30,17 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
> __entry->referenced = referenced;
> __entry->none_or_zero = none_or_zero;
> __entry->collapse = collapse;
> + __entry->unmapped = unmapped;
> ),
>
> - TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d",
> + TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d, unmapped=%d",
> __entry->mm,
> __entry->vm_start,
> __entry->writable,
> __entry->referenced,
> __entry->none_or_zero,
> - __entry->collapse)
> + __entry->collapse,
> + __entry->unmapped)
> );
>
> TRACE_EVENT(mm_collapse_huge_page,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9bb97fc..22bc0bf 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -24,6 +24,7 @@
> #include <linux/migrate.h>
> #include <linux/hashtable.h>
> #include <linux/userfaultfd_k.h>
> +#include <linux/swapops.h>
>
> #include <asm/tlb.h>
> #include <asm/pgalloc.h>
> @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> {
> pmd_t *pmd;
> pte_t *pte, *_pte;
> - int ret = 0, none_or_zero = 0;
> + int ret = 0, none_or_zero = 0, unmapped = 0;
> struct page *page;
> unsigned long _address;
> spinlock_t *ptl;
> - int node = NUMA_NO_NODE;
> + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8;
I think this deserve sysfs knob. Like we have for
khugepaged_max_ptes_none.
> bool writable = false, referenced = false;
>
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> @@ -2657,6 +2658,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
> _pte++, _address += PAGE_SIZE) {
> pte_t pteval = *_pte;
> + if (is_swap_pte(pteval)) {
IIRC, is_swap_pte() is true for migration entries too,
Should we distinguish swap entries from migration entries here?
I guess no. On other hand we can expect migration entires to be converted
to normal ptes soon...
> + if (++unmapped <= max_ptes_swap)
> + continue;
> + else
> + goto out_unmap;
> + }
> if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> if (!userfaultfd_armed(vma) &&
> ++none_or_zero <= khugepaged_max_ptes_none)
> @@ -2701,7 +2708,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> out_unmap:
> pte_unmap_unlock(pte, ptl);
> trace_mm_khugepaged_scan_pmd(mm, vma->vm_start, writable, referenced,
> - none_or_zero, ret);
> + none_or_zero, ret, unmapped);
> if (ret) {
> node = khugepaged_find_target_node();
> /* collapse_huge_page will return with the mmap_sem released */
> --
> 1.9.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kirill A. Shutemov
On Sat, Jun 20, 2015 at 02:28:06PM +0300, Ebru Akagunduz wrote:
> This patch makes swapin readahead to improve thp collapse rate.
> When khugepaged scanned pages, there can be a few of the pages
> in swap area.
>
> With the patch THP can collapse 4kB pages into a THP when
> there are up to max_ptes_swap swap ptes in a 2MB range.
>
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all. Afterwards, the test program
> touches the area by writing, it skips a page in each
> 20 pages of the area.
>
> Without the patch, system did not swap in readahead.
> THP rate was %47 of the program of the memory, it
> did not change over time.
>
> With this patch, after 10 minutes of waiting khugepaged had
> collapsed %99 of the program's memory.
>
> Signed-off-by: Ebru Akagunduz <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> ---
> Changes in v2:
> - Use FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT flag
> instead of 0x0 when called do_swap_page
> from __collapse_huge_page_swapin
>
> Test results:
>
> After swapped out
> -------------------------------------------------------------------
> | Anonymous | AnonHugePages | Swap | Fraction |
> -------------------------------------------------------------------
> With patch | 265772 kB | 264192 kB | 534232 kB | %99 |
> -------------------------------------------------------------------
> Without patch | 238160 kB | 235520 kB | 561844 kB | %98 |
> -------------------------------------------------------------------
>
> After swapped in
> -------------------------------------------------------------------
> | Anonymous | AnonHugePages | Swap | Fraction |
> -------------------------------------------------------------------
> With patch | 532756 kB | 528384 kB | 267248 kB | %99 |
> -------------------------------------------------------------------
> Without patch | 499956 kB | 235520 kB | 300048 kB | %47 |
> -------------------------------------------------------------------
>
> include/linux/mm.h | 4 ++++
> include/trace/events/huge_memory.h | 24 ++++++++++++++++++++++++
> mm/huge_memory.c | 37 +++++++++++++++++++++++++++++++++++++
> mm/memory.c | 2 +-
> 4 files changed, 66 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7f47178..f66ff8a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -29,6 +29,10 @@ struct user_struct;
> struct writeback_control;
> struct bdi_writeback;
>
> +extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, pte_t *page_table, pmd_t *pmd,
> + unsigned int flags, pte_t orig_pte);
> +
> #ifndef CONFIG_NEED_MULTIPLE_NODES /* Don't use mapnrs, do it properly */
> extern unsigned long max_mapnr;
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 53c9f2e..0117ab9 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -95,5 +95,29 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
> __entry->writable)
> );
>
> +TRACE_EVENT(mm_collapse_huge_page_swapin,
> +
> + TP_PROTO(struct mm_struct *mm, unsigned long vm_start, int swap_pte),
> +
> + TP_ARGS(mm, vm_start, swap_pte),
> +
> + TP_STRUCT__entry(
> + __field(struct mm_struct *, mm)
> + __field(unsigned long, vm_start)
> + __field(int, swap_pte)
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->vm_start = vm_start;
> + __entry->swap_pte = swap_pte;
> + ),
> +
> + TP_printk("mm=%p, vm_start=%04lx, swap_pte=%d",
> + __entry->mm,
> + __entry->vm_start,
> + __entry->swap_pte)
> +);
> +
> #endif /* __HUGE_MEMORY_H */
> #include <trace/define_trace.h>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 22bc0bf..064fd72 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2496,6 +2496,41 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
> return true;
> }
>
> +/*
> + * Bring missing pages in from swap, to complete THP collapse.
> + * Only done if khugepaged_scan_pmd believes it is worthwhile.
> + *
> + * Called and returns without pte mapped or spinlocks held,
> + * but with mmap_sem held to protect against vma changes.
> + */
> +
> +static void __collapse_huge_page_swapin(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + unsigned long address, pmd_t *pmd,
> + pte_t *pte)
> +{
> + unsigned long _address;
> + pte_t pteval = *pte;
> + int swap_pte = 0;
> +
> + pte = pte_offset_map(pmd, address);
> + for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
> + pte++, _address += PAGE_SIZE) {
> + pteval = *pte;
> + if (is_swap_pte(pteval)) {
> + swap_pte++;
> + do_swap_page(mm, vma, _address, pte, pmd,
> + FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
> + pteval);
Hm. I guess this lacking error handling.
We really should abort early at least for VM_FAULT_HWPOISON and VM_FAULT_OOM.
> + /* pte is unmapped now, we need to map it */
> + pte = pte_offset_map(pmd, _address);
No, it's within the same pte page table. It should be mapped with
pte_offset_map() above.
> + }
> + }
> + pte--;
> + pte_unmap(pte);
> + trace_mm_collapse_huge_page_swapin(mm, vma->vm_start, swap_pte);
> +}
> +
> static void collapse_huge_page(struct mm_struct *mm,
> unsigned long address,
> struct page **hpage,
> @@ -2551,6 +2586,8 @@ static void collapse_huge_page(struct mm_struct *mm,
> if (!pmd)
> goto out;
>
> + __collapse_huge_page_swapin(mm, vma, address, pmd, pte);
> +
And now the pages we swapped in are not isolated, right?
What prevents them from being swapped out again or whatever?
> anon_vma_lock_write(vma->anon_vma);
>
> pte = pte_offset_map(pmd, address);
> diff --git a/mm/memory.c b/mm/memory.c
> index e1c45d0..d801dc5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2443,7 +2443,7 @@ EXPORT_SYMBOL(unmap_mapping_range);
> * We return with the mmap_sem locked or unlocked in the same cases
> * as does filemap_fault().
> */
> -static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, pte_t *page_table, pmd_t *pmd,
> unsigned int flags, pte_t orig_pte)
> {
> --
> 1.9.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kirill A. Shutemov
On Sun, Jun 21, 2015 at 09:11:31PM +0300, Kirill A. Shutemov wrote:
> On Sat, Jun 20, 2015 at 02:28:06PM +0300, Ebru Akagunduz wrote:
> > + /* pte is unmapped now, we need to map it */
> > + pte = pte_offset_map(pmd, _address);
>
> No, it's within the same pte page table. It should be mapped with
> pte_offset_map() above.
Ahh.. do_swap_page() will unmap it. Probably worth rewording the comment.
--
Kirill A. Shutemov
On 06/21/2015 02:11 PM, Kirill A. Shutemov wrote:
> On Sat, Jun 20, 2015 at 02:28:06PM +0300, Ebru Akagunduz wrote:
>> +static void __collapse_huge_page_swapin(struct mm_struct *mm,
>> + struct vm_area_struct *vma,
>> + unsigned long address, pmd_t *pmd,
>> + pte_t *pte)
>> +{
>> + unsigned long _address;
>> + pte_t pteval = *pte;
>> + int swap_pte = 0;
>> +
>> + pte = pte_offset_map(pmd, address);
>> + for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
>> + pte++, _address += PAGE_SIZE) {
>> + pteval = *pte;
>> + if (is_swap_pte(pteval)) {
>> + swap_pte++;
>> + do_swap_page(mm, vma, _address, pte, pmd,
>> + FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
>> + pteval);
>
> Hm. I guess this lacking error handling.
> We really should abort early at least for VM_FAULT_HWPOISON and VM_FAULT_OOM.
Good catch.
>> + /* pte is unmapped now, we need to map it */
>> + pte = pte_offset_map(pmd, _address);
>
> No, it's within the same pte page table. It should be mapped with
> pte_offset_map() above.
It would be, except do_swap_page() unmaps the pte page table.
>> @@ -2551,6 +2586,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>> if (!pmd)
>> goto out;
>>
>> + __collapse_huge_page_swapin(mm, vma, address, pmd, pte);
>> +
>
> And now the pages we swapped in are not isolated, right?
> What prevents them from being swapped out again or whatever?
Nothing, but __collapse_huge_page_isolate is run with the
appropriate locks to ensure that once we actually collapse
the THP, things are present.
The way do_swap_page is called, khugepaged does not even
wait for pages to be brought in from swap. It just maps
in pages that are in the swap cache, and which can be
immediately locked (without waiting).
It will also start IO on pages that are not in memory
yet, and will hopefully get those next round.
--
All rights reversed
On 06/20/2015 01:28 PM, Ebru Akagunduz wrote:
> Using static tracepoints, data of functions is recorded.
> It is good to automatize debugging without doing a lot
> of changes in the source code.
I agree and welcome the addition. But to get most of the tracepoints, I'd like
to suggest quite a lot of improvements below.
> This patch adds tracepoint for khugepaged_scan_pmd,
> collapse_huge_page and __collapse_huge_page_isolate.
>
> Signed-off-by: Ebru Akagunduz <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> ---
> Changes in v2:
> - Nothing changed
>
> include/trace/events/huge_memory.h | 96 ++++++++++++++++++++++++++++++++++++++
> mm/huge_memory.c | 10 +++-
> 2 files changed, 105 insertions(+), 1 deletion(-)
> create mode 100644 include/trace/events/huge_memory.h
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> new file mode 100644
> index 0000000..4b9049b
> --- /dev/null
> +++ b/include/trace/events/huge_memory.h
> @@ -0,0 +1,96 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM huge_memory
> +
> +#if !defined(__HUGE_MEMORY_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define __HUGE_MEMORY_H
> +
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(mm_khugepaged_scan_pmd,
> +
> + TP_PROTO(struct mm_struct *mm, unsigned long vm_start, bool writable,
> + bool referenced, int none_or_zero, int collapse),
> +
> + TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse),
> +
> + TP_STRUCT__entry(
> + __field(struct mm_struct *, mm)
> + __field(unsigned long, vm_start)
> + __field(bool, writable)
> + __field(bool, referenced)
> + __field(int, none_or_zero)
> + __field(int, collapse)
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->vm_start = vm_start;
> + __entry->writable = writable;
> + __entry->referenced = referenced;
> + __entry->none_or_zero = none_or_zero;
> + __entry->collapse = collapse;
> + ),
> +
> + TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d",
> + __entry->mm,
> + __entry->vm_start,
> + __entry->writable,
> + __entry->referenced,
> + __entry->none_or_zero,
> + __entry->collapse)
> +);
> +
> +TRACE_EVENT(mm_collapse_huge_page,
> +
> + TP_PROTO(struct mm_struct *mm, unsigned long vm_start, int isolated),
Why vm_start and not the address of the page itself?
> +
> + TP_ARGS(mm, vm_start, isolated),
> +
> + TP_STRUCT__entry(
> + __field(struct mm_struct *, mm)
> + __field(unsigned long, vm_start)
> + __field(int, isolated)
> + ),
> +
> + TP_fast_assign(
> + __entry->mm = mm;
> + __entry->vm_start = vm_start;
> + __entry->isolated = isolated;
> + ),
> +
> + TP_printk("mm=%p, vm_start=%04lx, isolated=%d",
> + __entry->mm,
> + __entry->vm_start,
> + __entry->isolated)
> +);
> +
> +TRACE_EVENT(mm_collapse_huge_page_isolate,
> +
> + TP_PROTO(unsigned long vm_start, int none_or_zero,
> + bool referenced, bool writable),
> +
> + TP_ARGS(vm_start, none_or_zero, referenced, writable),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, vm_start)
> + __field(int, none_or_zero)
> + __field(bool, referenced)
> + __field(bool, writable)
> + ),
> +
> + TP_fast_assign(
> + __entry->vm_start = vm_start;
> + __entry->none_or_zero = none_or_zero;
> + __entry->referenced = referenced;
> + __entry->writable = writable;
> + ),
> +
> + TP_printk("vm_start=%04lx, none_or_zero=%d, referenced=%d, writable=%d",
> + __entry->vm_start,
> + __entry->none_or_zero,
> + __entry->referenced,
> + __entry->writable)
> +);
> +
> +#endif /* __HUGE_MEMORY_H */
> +#include <trace/define_trace.h>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9671f51..9bb97fc 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -29,6 +29,9 @@
> #include <asm/pgalloc.h>
> #include "internal.h"
>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/huge_memory.h>
> +
> /*
> * By default transparent hugepage support is disabled in order that avoid
> * to risk increase the memory footprint of applications without a guaranteed
> @@ -2266,6 +2269,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> if (likely(referenced && writable))
> return 1;
No tracepoint here, so it only traces isolation failures. That's misleading,
because the event name doesnt suggest that. I suggest recording both, and
distinguishing by another event parameter, with value matching what's returned
from the function.
> out:
> + trace_mm_collapse_huge_page_isolate(vma->vm_start, none_or_zero,
Again, why vm_start and not the value of address? Or maybe, the initial value of
address (=the beginning of future hugepage) in case isolation succeeds, or the
current value of address when it fails on a particular pte due to one of the
"goto out"s.
> + referenced, writable);
OK so again there are numerous reasons why isolation can fail due to the "goto
out"s, which the tracepoint doesn't tell us. "referenced" and "writable" tells
us that if we've seen such pte in the previous iterations, but otherwise they
may have nothing to do with the failure. We could distinguish "unexpected pte",
failure to lock, gup pin, reuse_swap_page() fail, isolate_lru_page() fail...
> release_pte_pages(pte, _pte);
> return 0;
> }
> @@ -2501,7 +2506,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> pgtable_t pgtable;
> struct page *new_page;
> spinlock_t *pmd_ptl, *pte_ptl;
> - int isolated;
> + int isolated = 0;
It's only used for 0/1 so why not convert it to bool, together with the return
value of __collapse_huge_page_isolate(). And the tracepoints accordingly.
> unsigned long hstart, hend;
> struct mem_cgroup *memcg;
> unsigned long mmun_start; /* For mmu_notifiers */
> @@ -2619,6 +2624,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> khugepaged_pages_collapsed++;
> out_up_write:
> up_write(&mm->mmap_sem);
> + trace_mm_collapse_huge_page(mm, vma->vm_start, isolated);
The tracepoint as it is cannot distinguish many cases why collapse_huge_page()
fails:
- khugepaged_alloc_page() fails, or cgroup charge fails. In that case the
tracepoint isn't even called, which might be surprising.
- the various checks before isolation is attempted fail. Isolated will be
reported as 0 which might suggest it failed, but in fact it wasn't even attempted.
Distinguishing all reasons to fail would be probably overkill, but it would make
sense to report separately allocation fail, memcg charge fail, isolation not
attempted (= any of checks after taking mmap_sem fail), isolation failed.
It would be a bit intrusive even when tracepoint is disabled, but this is not
exactly hot path. See e.g. how trace_mm_compaction_end reports various outcomes
of the status as a string.
> return;
>
> out:
> @@ -2694,6 +2700,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> ret = 1;
> out_unmap:
> pte_unmap_unlock(pte, ptl);
> + trace_mm_khugepaged_scan_pmd(mm, vma->vm_start, writable, referenced,
> + none_or_zero, ret);
This is similar to trace_mm_collapse_huge_page_isolate and I think all my
suggestions apply here too. In fact the tracepoints could probably have the same
signature and you could use a single DECLARE_EVENT_CLASS for them both.
> if (ret) {
> node = khugepaged_find_target_node();
> /* collapse_huge_page will return with the mmap_sem released */
>
On 06/22/2015 03:37 AM, Rik van Riel wrote:
> On 06/21/2015 02:11 PM, Kirill A. Shutemov wrote:
>> On Sat, Jun 20, 2015 at 02:28:06PM +0300, Ebru Akagunduz wrote:
>>> + __collapse_huge_page_swapin(mm, vma, address, pmd, pte);
>>> +
>>
>> And now the pages we swapped in are not isolated, right?
>> What prevents them from being swapped out again or whatever?
>
> Nothing, but __collapse_huge_page_isolate is run with the
> appropriate locks to ensure that once we actually collapse
> the THP, things are present.
>
> The way do_swap_page is called, khugepaged does not even
> wait for pages to be brought in from swap. It just maps
> in pages that are in the swap cache, and which can be
> immediately locked (without waiting).
>
> It will also start IO on pages that are not in memory
> yet, and will hopefully get those next round.
Hm so what if the process is slightly larger than available memory and really
doesn't touch the swapped out pages that much? Won't that just be thrashing and
next round you find them swapped out again?
On 06/24/2015 08:33 AM, Vlastimil Babka wrote:
> On 06/22/2015 03:37 AM, Rik van Riel wrote:
>> On 06/21/2015 02:11 PM, Kirill A. Shutemov wrote:
>>> On Sat, Jun 20, 2015 at 02:28:06PM +0300, Ebru Akagunduz wrote:
>>>> + __collapse_huge_page_swapin(mm, vma, address, pmd, pte);
>>>> +
>>>
>>> And now the pages we swapped in are not isolated, right?
>>> What prevents them from being swapped out again or whatever?
>>
>> Nothing, but __collapse_huge_page_isolate is run with the
>> appropriate locks to ensure that once we actually collapse
>> the THP, things are present.
>>
>> The way do_swap_page is called, khugepaged does not even
>> wait for pages to be brought in from swap. It just maps
>> in pages that are in the swap cache, and which can be
>> immediately locked (without waiting).
>>
>> It will also start IO on pages that are not in memory
>> yet, and will hopefully get those next round.
>
> Hm so what if the process is slightly larger than available memory and really
> doesn't touch the swapped out pages that much? Won't that just be thrashing and
> next round you find them swapped out again?
Yes, it might.
However, all the policy smarts are in patch 2/3, not in
patch 3/3 (which has the mechanism).
I suspect the code could use some more smarts, but I am
not quite sure what they should be...
--
All rights reversed