2015-06-14 15:05:08

by Ebru Akagunduz

[permalink] [raw]
Subject: [RFC 0/3] mm: make swapin readahead to gain more thp performance

This patch series makes swapin readahead up to a
certain number to gain more thp performance and adds
tracepoint for khugepaged_scan_pmd, collapse_huge_page,
__collapse_huge_page_isolate.

This patch series was written to deal with programs
that access most, but not all, of their memory after
they get swapped out. Currently these programs do not
get their memory collapsed into THPs after the system
swapped their memory out, while they would get THPs
before swapping happened.

This patch series was tested with a test program,
it allocates 800MB of memory, writes to it, and
then sleeps. I force the system to swap out all.
Afterwards, the test program touches the area by
writing and leaves a piece of it without writing.
This shows how much swap in readahead made by the
patch.

I've written down test results:

With the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous: 470760 kB
AnonHugePages: 468992 kB
Swap: 329244 kB
Fraction: %99

After swapped in:
In ten minutes:
cat /proc/pid/smaps:
Anonymous: 769208 kB
AnonHugePages: 765952 kB
Swap: 30796 kB
Fraction: %99

Without the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous: 238160 kB
AnonHugePages: 235520 kB
Swap: 561844 kB
Fraction: %98

After swapped in:
cat /proc/pid/smaps:
In ten minutes:
Anonymous: 499956 kB
AnonHugePages: 235520 kB
Swap: 300048 kB
Fraction: %47

Ebru Akagunduz (3):
mm: add tracepoint for scanning pages
mm: make optimistic check for swapin readahead
mm: make swapin readahead to improve thp collapse rate

include/linux/mm.h | 4 ++
include/trace/events/huge_memory.h | 123 +++++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 56 ++++++++++++++++-
mm/memory.c | 2 +-
4 files changed, 181 insertions(+), 4 deletions(-)
create mode 100644 include/trace/events/huge_memory.h

--
1.9.1


2015-06-14 15:05:26

by Ebru Akagunduz

[permalink] [raw]
Subject: [RFC 1/3] mm: add tracepoint for scanning pages

Using static tracepoints, data of functions is recorded.
It is good to automatize debugging without doing a lot
of changes in the source code.

This patch adds tracepoint for khugepaged_scan_pmd,
collapse_huge_page and __collapse_huge_page_isolate.

Signed-off-by: Ebru Akagunduz <[email protected]>
---
include/trace/events/huge_memory.h | 96 ++++++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 10 +++-
2 files changed, 105 insertions(+), 1 deletion(-)
create mode 100644 include/trace/events/huge_memory.h

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
new file mode 100644
index 0000000..4b9049b
--- /dev/null
+++ b/include/trace/events/huge_memory.h
@@ -0,0 +1,96 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM huge_memory
+
+#if !defined(__HUGE_MEMORY_H) || defined(TRACE_HEADER_MULTI_READ)
+#define __HUGE_MEMORY_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(mm_khugepaged_scan_pmd,
+
+ TP_PROTO(struct mm_struct *mm, unsigned long vm_start, bool writable,
+ bool referenced, int none_or_zero, int collapse),
+
+ TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, vm_start)
+ __field(bool, writable)
+ __field(bool, referenced)
+ __field(int, none_or_zero)
+ __field(int, collapse)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->vm_start = vm_start;
+ __entry->writable = writable;
+ __entry->referenced = referenced;
+ __entry->none_or_zero = none_or_zero;
+ __entry->collapse = collapse;
+ ),
+
+ TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d",
+ __entry->mm,
+ __entry->vm_start,
+ __entry->writable,
+ __entry->referenced,
+ __entry->none_or_zero,
+ __entry->collapse)
+);
+
+TRACE_EVENT(mm_collapse_huge_page,
+
+ TP_PROTO(struct mm_struct *mm, unsigned long vm_start, int isolated),
+
+ TP_ARGS(mm, vm_start, isolated),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, vm_start)
+ __field(int, isolated)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->vm_start = vm_start;
+ __entry->isolated = isolated;
+ ),
+
+ TP_printk("mm=%p, vm_start=%04lx, isolated=%d",
+ __entry->mm,
+ __entry->vm_start,
+ __entry->isolated)
+);
+
+TRACE_EVENT(mm_collapse_huge_page_isolate,
+
+ TP_PROTO(unsigned long vm_start, int none_or_zero,
+ bool referenced, bool writable),
+
+ TP_ARGS(vm_start, none_or_zero, referenced, writable),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, vm_start)
+ __field(int, none_or_zero)
+ __field(bool, referenced)
+ __field(bool, writable)
+ ),
+
+ TP_fast_assign(
+ __entry->vm_start = vm_start;
+ __entry->none_or_zero = none_or_zero;
+ __entry->referenced = referenced;
+ __entry->writable = writable;
+ ),
+
+ TP_printk("vm_start=%04lx, none_or_zero=%d, referenced=%d, writable=%d",
+ __entry->vm_start,
+ __entry->none_or_zero,
+ __entry->referenced,
+ __entry->writable)
+);
+
+#endif /* __HUGE_MEMORY_H */
+#include <trace/define_trace.h>
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9671f51..9bb97fc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -29,6 +29,9 @@
#include <asm/pgalloc.h>
#include "internal.h"

+#define CREATE_TRACE_POINTS
+#include <trace/events/huge_memory.h>
+
/*
* By default transparent hugepage support is disabled in order that avoid
* to risk increase the memory footprint of applications without a guaranteed
@@ -2266,6 +2269,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
if (likely(referenced && writable))
return 1;
out:
+ trace_mm_collapse_huge_page_isolate(vma->vm_start, none_or_zero,
+ referenced, writable);
release_pte_pages(pte, _pte);
return 0;
}
@@ -2501,7 +2506,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pgtable_t pgtable;
struct page *new_page;
spinlock_t *pmd_ptl, *pte_ptl;
- int isolated;
+ int isolated = 0;
unsigned long hstart, hend;
struct mem_cgroup *memcg;
unsigned long mmun_start; /* For mmu_notifiers */
@@ -2619,6 +2624,7 @@ static void collapse_huge_page(struct mm_struct *mm,
khugepaged_pages_collapsed++;
out_up_write:
up_write(&mm->mmap_sem);
+ trace_mm_collapse_huge_page(mm, vma->vm_start, isolated);
return;

out:
@@ -2694,6 +2700,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
ret = 1;
out_unmap:
pte_unmap_unlock(pte, ptl);
+ trace_mm_khugepaged_scan_pmd(mm, vma->vm_start, writable, referenced,
+ none_or_zero, ret);
if (ret) {
node = khugepaged_find_target_node();
/* collapse_huge_page will return with the mmap_sem released */
--
1.9.1

2015-06-14 15:05:21

by Ebru Akagunduz

[permalink] [raw]
Subject: [RFC 2/3] mm: make optimistic check for swapin readahead

This patch makes optimistic check for swapin readahead
to increase thp collapse rate. Before getting swapped
out pages to memory, checks them and allows up to a
certain number. It also prints out using tracepoints
amount of unmapped ptes.

Signed-off-by: Ebru Akagunduz <[email protected]>
---
include/trace/events/huge_memory.h | 11 +++++++----
mm/huge_memory.c | 13 ++++++++++---
2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 4b9049b..53c9f2e 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -9,9 +9,9 @@
TRACE_EVENT(mm_khugepaged_scan_pmd,

TP_PROTO(struct mm_struct *mm, unsigned long vm_start, bool writable,
- bool referenced, int none_or_zero, int collapse),
+ bool referenced, int none_or_zero, int collapse, int unmapped),

- TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse),
+ TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse, unmapped),

TP_STRUCT__entry(
__field(struct mm_struct *, mm)
@@ -20,6 +20,7 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
__field(bool, referenced)
__field(int, none_or_zero)
__field(int, collapse)
+ __field(int, unmapped)
),

TP_fast_assign(
@@ -29,15 +30,17 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
__entry->referenced = referenced;
__entry->none_or_zero = none_or_zero;
__entry->collapse = collapse;
+ __entry->unmapped = unmapped;
),

- TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d",
+ TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d, unmapped=%d",
__entry->mm,
__entry->vm_start,
__entry->writable,
__entry->referenced,
__entry->none_or_zero,
- __entry->collapse)
+ __entry->collapse,
+ __entry->unmapped)
);

TRACE_EVENT(mm_collapse_huge_page,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9bb97fc..22bc0bf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -24,6 +24,7 @@
#include <linux/migrate.h>
#include <linux/hashtable.h>
#include <linux/userfaultfd_k.h>
+#include <linux/swapops.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
{
pmd_t *pmd;
pte_t *pte, *_pte;
- int ret = 0, none_or_zero = 0;
+ int ret = 0, none_or_zero = 0, unmapped = 0;
struct page *page;
unsigned long _address;
spinlock_t *ptl;
- int node = NUMA_NO_NODE;
+ int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8;
bool writable = false, referenced = false;

VM_BUG_ON(address & ~HPAGE_PMD_MASK);
@@ -2657,6 +2658,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
_pte++, _address += PAGE_SIZE) {
pte_t pteval = *_pte;
+ if (is_swap_pte(pteval)) {
+ if (++unmapped <= max_ptes_swap)
+ continue;
+ else
+ goto out_unmap;
+ }
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
if (!userfaultfd_armed(vma) &&
++none_or_zero <= khugepaged_max_ptes_none)
@@ -2701,7 +2708,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
out_unmap:
pte_unmap_unlock(pte, ptl);
trace_mm_khugepaged_scan_pmd(mm, vma->vm_start, writable, referenced,
- none_or_zero, ret);
+ none_or_zero, ret, unmapped);
if (ret) {
node = khugepaged_find_target_node();
/* collapse_huge_page will return with the mmap_sem released */
--
1.9.1

2015-06-14 15:05:35

by Ebru Akagunduz

[permalink] [raw]
Subject: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate

This patch makes swapin readahead to improve thp collapse rate.
When khugepaged scanned pages, there can be a few of the pages
in swap area.

With the patch THP can collapse 4kB pages into a THP when
there are up to max_ptes_swap swap ptes in a 2MB range.

The patch was tested with a test program that allocates
800MB of memory, writes to it, and then sleeps. I force
the system to swap out all. Afterwards, the test program
touches the area by writing, it skips a page in each
20 pages of the area.

Without the patch, system did not swap in readahead.
THP rate was %47 of the program of the memory, it
did not change over time.

With this patch, after 10 minutes of waiting khugepaged had
collapsed %99 of the program's memory.

Signed-off-by: Ebru Akagunduz <[email protected]>
---
I've written down test results:

With the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous: 470760 kB
AnonHugePages: 468992 kB
Swap: 329244 kB
Fraction: %99

After swapped in:
In ten minutes:
cat /proc/pid/smaps:
Anonymous: 769208 kB
AnonHugePages: 765952 kB
Swap: 30796 kB
Fraction: %99

Without the patch:
After swapped out:
cat /proc/pid/smaps:
Anonymous: 238160 kB
AnonHugePages: 235520 kB
Swap: 561844 kB
Fraction: %98

After swapped in:
cat /proc/pid/smaps:
In ten minutes:
Anonymous: 499956 kB
AnonHugePages: 235520 kB
Swap: 300048 kB
Fraction: %47

include/linux/mm.h | 4 ++++
include/trace/events/huge_memory.h | 24 ++++++++++++++++++++++++
mm/huge_memory.c | 35 +++++++++++++++++++++++++++++++++++
mm/memory.c | 2 +-
4 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7f47178..f66ff8a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -29,6 +29,10 @@ struct user_struct;
struct writeback_control;
struct bdi_writeback;

+extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ unsigned int flags, pte_t orig_pte);
+
#ifndef CONFIG_NEED_MULTIPLE_NODES /* Don't use mapnrs, do it properly */
extern unsigned long max_mapnr;

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 53c9f2e..0117ab9 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -95,5 +95,29 @@ TRACE_EVENT(mm_collapse_huge_page_isolate,
__entry->writable)
);

+TRACE_EVENT(mm_collapse_huge_page_swapin,
+
+ TP_PROTO(struct mm_struct *mm, unsigned long vm_start, int swap_pte),
+
+ TP_ARGS(mm, vm_start, swap_pte),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __field(unsigned long, vm_start)
+ __field(int, swap_pte)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __entry->vm_start = vm_start;
+ __entry->swap_pte = swap_pte;
+ ),
+
+ TP_printk("mm=%p, vm_start=%04lx, swap_pte=%d",
+ __entry->mm,
+ __entry->vm_start,
+ __entry->swap_pte)
+);
+
#endif /* __HUGE_MEMORY_H */
#include <trace/define_trace.h>
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 22bc0bf..cb3e82a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2496,6 +2496,39 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
return true;
}

+/*
+ * Bring missing pages in from swap, to complete THP collapse.
+ * Only done if khugepaged_scan_pmd believes it is worthwhile.
+ *
+ * Called and returns without pte mapped or spinlocks held,
+ * but with mmap_sem held to protect against vma changes.
+ */
+
+static void __collapse_huge_page_swapin(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ pte_t *pte)
+{
+ unsigned long _address;
+ pte_t pteval = *pte;
+ int swap_pte = 0;
+
+ pte = pte_offset_map(pmd, address);
+ for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
+ pte++, _address += PAGE_SIZE) {
+ pteval = *pte;
+ if (is_swap_pte(pteval)) {
+ swap_pte++;
+ do_swap_page(mm, vma, _address, pte, pmd, 0x0, pteval);
+ /* pte is unmapped now, we need to map it */
+ pte = pte_offset_map(pmd, _address);
+ }
+ }
+ pte--;
+ pte_unmap(pte);
+ trace_mm_collapse_huge_page_swapin(mm, vma->vm_start, swap_pte);
+}
+
static void collapse_huge_page(struct mm_struct *mm,
unsigned long address,
struct page **hpage,
@@ -2551,6 +2584,8 @@ static void collapse_huge_page(struct mm_struct *mm,
if (!pmd)
goto out;

+ __collapse_huge_page_swapin(mm, vma, address, pmd, pte);
+
anon_vma_lock_write(vma->anon_vma);

pte = pte_offset_map(pmd, address);
diff --git a/mm/memory.c b/mm/memory.c
index e1c45d0..d801dc5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2443,7 +2443,7 @@ EXPORT_SYMBOL(unmap_mapping_range);
* We return with the mmap_sem locked or unlocked in the same cases
* as does filemap_fault().
*/
-static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
+int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *page_table, pmd_t *pmd,
unsigned int flags, pte_t orig_pte)
{
--
1.9.1

2015-06-15 01:04:46

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC 1/3] mm: add tracepoint for scanning pages

On 06/14/2015 11:04 AM, Ebru Akagunduz wrote:
> Using static tracepoints, data of functions is recorded.
> It is good to automatize debugging without doing a lot
> of changes in the source code.
>
> This patch adds tracepoint for khugepaged_scan_pmd,
> collapse_huge_page and __collapse_huge_page_isolate.

These trace points seem like a useful set to figure out what
the THP collapse code is doing.

> Signed-off-by: Ebru Akagunduz <[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2015-06-15 05:40:34

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead

On Sun, Jun 14, 2015 at 6:04 PM, Ebru Akagunduz
<[email protected]> wrote:
> This patch makes optimistic check for swapin readahead
> to increase thp collapse rate. Before getting swapped
> out pages to memory, checks them and allows up to a
> certain number. It also prints out using tracepoints
> amount of unmapped ptes.
>
> Signed-off-by: Ebru Akagunduz <[email protected]>
> ---
> include/trace/events/huge_memory.h | 11 +++++++----
> mm/huge_memory.c | 13 ++++++++++---
> 2 files changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 4b9049b..53c9f2e 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -9,9 +9,9 @@
> TRACE_EVENT(mm_khugepaged_scan_pmd,
>
> TP_PROTO(struct mm_struct *mm, unsigned long vm_start, bool writable,
> - bool referenced, int none_or_zero, int collapse),
> + bool referenced, int none_or_zero, int collapse, int unmapped),
>
> - TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse),
> + TP_ARGS(mm, vm_start, writable, referenced, none_or_zero, collapse, unmapped),
>
> TP_STRUCT__entry(
> __field(struct mm_struct *, mm)
> @@ -20,6 +20,7 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
> __field(bool, referenced)
> __field(int, none_or_zero)
> __field(int, collapse)
> + __field(int, unmapped)
> ),
>
> TP_fast_assign(
> @@ -29,15 +30,17 @@ TRACE_EVENT(mm_khugepaged_scan_pmd,
> __entry->referenced = referenced;
> __entry->none_or_zero = none_or_zero;
> __entry->collapse = collapse;
> + __entry->unmapped = unmapped;
> ),
>
> - TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d",
> + TP_printk("mm=%p, vm_start=%04lx, writable=%d, referenced=%d, none_or_zero=%d, collapse=%d, unmapped=%d",
> __entry->mm,
> __entry->vm_start,
> __entry->writable,
> __entry->referenced,
> __entry->none_or_zero,
> - __entry->collapse)
> + __entry->collapse,
> + __entry->unmapped)
> );
>
> TRACE_EVENT(mm_collapse_huge_page,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9bb97fc..22bc0bf 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -24,6 +24,7 @@
> #include <linux/migrate.h>
> #include <linux/hashtable.h>
> #include <linux/userfaultfd_k.h>
> +#include <linux/swapops.h>
>
> #include <asm/tlb.h>
> #include <asm/pgalloc.h>
> @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> {
> pmd_t *pmd;
> pte_t *pte, *_pte;
> - int ret = 0, none_or_zero = 0;
> + int ret = 0, none_or_zero = 0, unmapped = 0;
> struct page *page;
> unsigned long _address;
> spinlock_t *ptl;
> - int node = NUMA_NO_NODE;
> + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8;
Sorry for asking, my knoweldge of THP is very limited, but why did you
choose this default value?
>From the discussion followed by your patch
(https://lkml.org/lkml/2015/2/27/432), I got an impression that it is
not necessary right value.

> bool writable = false, referenced = false;
>
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> @@ -2657,6 +2658,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
> _pte++, _address += PAGE_SIZE) {
> pte_t pteval = *_pte;
> + if (is_swap_pte(pteval)) {
> + if (++unmapped <= max_ptes_swap)
> + continue;
> + else
> + goto out_unmap;
> + }
> if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> if (!userfaultfd_armed(vma) &&
> ++none_or_zero <= khugepaged_max_ptes_none)
> @@ -2701,7 +2708,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> out_unmap:
> pte_unmap_unlock(pte, ptl);
> trace_mm_khugepaged_scan_pmd(mm, vma->vm_start, writable, referenced,
> - none_or_zero, ret);
> + none_or_zero, ret, unmapped);
> if (ret) {
> node = khugepaged_find_target_node();
> /* collapse_huge_page will return with the mmap_sem released */
> --
> 1.9.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>



--
Leon Romanovsky | Independent Linux Consultant
http://www.leon.nu | [email protected]

2015-06-15 05:43:13

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead

On 06/15/2015 01:40 AM, Leon Romanovsky wrote:
> On Sun, Jun 14, 2015 at 6:04 PM, Ebru Akagunduz
> <[email protected]> wrote:
>> This patch makes optimistic check for swapin readahead
>> to increase thp collapse rate. Before getting swapped
>> out pages to memory, checks them and allows up to a
>> certain number. It also prints out using tracepoints
>> amount of unmapped ptes.
>>
>> Signed-off-by: Ebru Akagunduz <[email protected]>

>> @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>> {
>> pmd_t *pmd;
>> pte_t *pte, *_pte;
>> - int ret = 0, none_or_zero = 0;
>> + int ret = 0, none_or_zero = 0, unmapped = 0;
>> struct page *page;
>> unsigned long _address;
>> spinlock_t *ptl;
>> - int node = NUMA_NO_NODE;
>> + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8;
> Sorry for asking, my knoweldge of THP is very limited, but why did you
> choose this default value?
> From the discussion followed by your patch
> (https://lkml.org/lkml/2015/2/27/432), I got an impression that it is
> not necessary right value.

I believe that Ebru's main focus for this initial version of
the patch series was to get the _mechanism_ (patch 3) right,
while having a fairly simple policy to drive it.

Any suggestions on when it is a good idea to bring in pages
from swap, and whether to treat resident-in-swap-cache pages
differently from need-to-be-paged-in pages, and what other
factors should be examined, are very welcome...

--
All rights reversed

2015-06-15 06:08:33

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead

On Mon, Jun 15, 2015 at 8:43 AM, Rik van Riel <[email protected]> wrote:
> On 06/15/2015 01:40 AM, Leon Romanovsky wrote:
>> On Sun, Jun 14, 2015 at 6:04 PM, Ebru Akagunduz
>> <[email protected]> wrote:
>>> This patch makes optimistic check for swapin readahead
>>> to increase thp collapse rate. Before getting swapped
>>> out pages to memory, checks them and allows up to a
>>> certain number. It also prints out using tracepoints
>>> amount of unmapped ptes.
>>>
>>> Signed-off-by: Ebru Akagunduz <[email protected]>
>
>>> @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>> {
>>> pmd_t *pmd;
>>> pte_t *pte, *_pte;
>>> - int ret = 0, none_or_zero = 0;
>>> + int ret = 0, none_or_zero = 0, unmapped = 0;
>>> struct page *page;
>>> unsigned long _address;
>>> spinlock_t *ptl;
>>> - int node = NUMA_NO_NODE;
>>> + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8;
>> Sorry for asking, my knoweldge of THP is very limited, but why did you
>> choose this default value?
>> From the discussion followed by your patch
>> (https://lkml.org/lkml/2015/2/27/432), I got an impression that it is
>> not necessary right value.
>
> I believe that Ebru's main focus for this initial version of
> the patch series was to get the _mechanism_ (patch 3) right,
> while having a fairly simple policy to drive it.
>
> Any suggestions on when it is a good idea to bring in pages
> from swap, and whether to treat resident-in-swap-cache pages
> differently from need-to-be-paged-in pages, and what other
> factors should be examined, are very welcome...
My concern with these patches that they deal with specific
load/scenario (most of the application returned back from swap). In
scenario there only 10% of data will be required, it theoretically can
bring upto 80% data (70% waste).

>
> --
> All rights reversed



--
Leon Romanovsky | Independent Linux Consultant
http://www.leon.nu | [email protected]

2015-06-15 06:35:51

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead

On 06/15/2015 02:08 AM, Leon Romanovsky wrote:
> On Mon, Jun 15, 2015 at 8:43 AM, Rik van Riel <[email protected]> wrote:
>> On 06/15/2015 01:40 AM, Leon Romanovsky wrote:
>>> On Sun, Jun 14, 2015 at 6:04 PM, Ebru Akagunduz
>>> <[email protected]> wrote:
>>>> This patch makes optimistic check for swapin readahead
>>>> to increase thp collapse rate. Before getting swapped
>>>> out pages to memory, checks them and allows up to a
>>>> certain number. It also prints out using tracepoints
>>>> amount of unmapped ptes.
>>>>
>>>> Signed-off-by: Ebru Akagunduz <[email protected]>
>>
>>>> @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>>>> {
>>>> pmd_t *pmd;
>>>> pte_t *pte, *_pte;
>>>> - int ret = 0, none_or_zero = 0;
>>>> + int ret = 0, none_or_zero = 0, unmapped = 0;
>>>> struct page *page;
>>>> unsigned long _address;
>>>> spinlock_t *ptl;
>>>> - int node = NUMA_NO_NODE;
>>>> + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8;
>>> Sorry for asking, my knoweldge of THP is very limited, but why did you
>>> choose this default value?
>>> From the discussion followed by your patch
>>> (https://lkml.org/lkml/2015/2/27/432), I got an impression that it is
>>> not necessary right value.
>>
>> I believe that Ebru's main focus for this initial version of
>> the patch series was to get the _mechanism_ (patch 3) right,
>> while having a fairly simple policy to drive it.
>>
>> Any suggestions on when it is a good idea to bring in pages
>> from swap, and whether to treat resident-in-swap-cache pages
>> differently from need-to-be-paged-in pages, and what other
>> factors should be examined, are very welcome...
> My concern with these patches that they deal with specific
> load/scenario (most of the application returned back from swap). In
> scenario there only 10% of data will be required, it theoretically can
> bring upto 80% data (70% waste).

The chosen threshold ensures that the remaining non-resident
4kB pages in a THP are only brought in if 7/8th (or 87.5%) of
the pages are already resident.

--
All rights reversed

2015-06-15 14:00:08

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate

On 06/14/2015 11:04 AM, Ebru Akagunduz wrote:
> This patch makes swapin readahead to improve thp collapse rate.
> When khugepaged scanned pages, there can be a few of the pages
> in swap area.
>
> With the patch THP can collapse 4kB pages into a THP when
> there are up to max_ptes_swap swap ptes in a 2MB range.
>
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all. Afterwards, the test program
> touches the area by writing, it skips a page in each
> 20 pages of the area.
>
> Without the patch, system did not swap in readahead.
> THP rate was %47 of the program of the memory, it
> did not change over time.
>
> With this patch, after 10 minutes of waiting khugepaged had
> collapsed %99 of the program's memory.
>
> Signed-off-by: Ebru Akagunduz <[email protected]>

Mechanism looks good to me.

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2015-06-15 14:05:39

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead

On 06/14/2015 11:04 AM, Ebru Akagunduz wrote:
> This patch makes optimistic check for swapin readahead
> to increase thp collapse rate. Before getting swapped
> out pages to memory, checks them and allows up to a
> certain number. It also prints out using tracepoints
> amount of unmapped ptes.
>
> Signed-off-by: Ebru Akagunduz <[email protected]>

> @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> {
> pmd_t *pmd;
> pte_t *pte, *_pte;
> - int ret = 0, none_or_zero = 0;
> + int ret = 0, none_or_zero = 0, unmapped = 0;
> struct page *page;
> unsigned long _address;
> spinlock_t *ptl;
> - int node = NUMA_NO_NODE;
> + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8;
> bool writable = false, referenced = false;

This has the effect of only swapping in 4kB pages to form a THP
if 7/8th of the THP is already resident in memory.

This is a pretty conservative thing to do.

I am not sure if we would also need to take into account things
like these:
1) How many pages in the THP-area are recently referenced?
Maybe this does not matter if 87.5% of the 4kB pages got
faulted in after swap-out, anyway?
2) How much free memory does the system have?
We don't test that for collapsing a THP with lots of
pte_none() ptes, so not sure how much this matters...
3) How many of the pages we want to swap in are already resident
in the swap cache?
Not sure exactly what to do with this number...
4) other factors?

I am also not sure how we would determine such a policy, except
by maybe having these patches sit in -mm and -next for a few
cycles, and seeing what happens...


--
All rights reversed

2015-06-15 16:08:09

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC 2/3] mm: make optimistic check for swapin readahead

On Mon, Jun 15, 2015 at 5:05 PM, Rik van Riel <[email protected]> wrote:
>
> On 06/14/2015 11:04 AM, Ebru Akagunduz wrote:
> > This patch makes optimistic check for swapin readahead
> > to increase thp collapse rate. Before getting swapped
> > out pages to memory, checks them and allows up to a
> > certain number. It also prints out using tracepoints
> > amount of unmapped ptes.
> >
> > Signed-off-by: Ebru Akagunduz <[email protected]>
>
> > @@ -2639,11 +2640,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> > {
> > pmd_t *pmd;
> > pte_t *pte, *_pte;
> > - int ret = 0, none_or_zero = 0;
> > + int ret = 0, none_or_zero = 0, unmapped = 0;
> > struct page *page;
> > unsigned long _address;
> > spinlock_t *ptl;
> > - int node = NUMA_NO_NODE;
> > + int node = NUMA_NO_NODE, max_ptes_swap = HPAGE_PMD_NR/8;
> > bool writable = false, referenced = false;
>
> This has the effect of only swapping in 4kB pages to form a THP
> if 7/8th of the THP is already resident in memory.
Thanks for clarifing it to me.

2015-06-16 21:15:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate

On Sun, 14 Jun 2015 18:04:43 +0300 Ebru Akagunduz <[email protected]> wrote:

> This patch makes swapin readahead to improve thp collapse rate.
> When khugepaged scanned pages, there can be a few of the pages
> in swap area.
>
> With the patch THP can collapse 4kB pages into a THP when
> there are up to max_ptes_swap swap ptes in a 2MB range.
>
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all. Afterwards, the test program
> touches the area by writing, it skips a page in each
> 20 pages of the area.
>
> Without the patch, system did not swap in readahead.
> THP rate was %47 of the program of the memory, it
> did not change over time.
>
> With this patch, after 10 minutes of waiting khugepaged had
> collapsed %99 of the program's memory.
>
> ...
>
> +/*
> + * Bring missing pages in from swap, to complete THP collapse.
> + * Only done if khugepaged_scan_pmd believes it is worthwhile.
> + *
> + * Called and returns without pte mapped or spinlocks held,
> + * but with mmap_sem held to protect against vma changes.
> + */
> +
> +static void __collapse_huge_page_swapin(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> + unsigned long address, pmd_t *pmd,
> + pte_t *pte)
> +{
> + unsigned long _address;
> + pte_t pteval = *pte;
> + int swap_pte = 0;
> +
> + pte = pte_offset_map(pmd, address);
> + for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
> + pte++, _address += PAGE_SIZE) {
> + pteval = *pte;
> + if (is_swap_pte(pteval)) {
> + swap_pte++;
> + do_swap_page(mm, vma, _address, pte, pmd, 0x0, pteval);
> + /* pte is unmapped now, we need to map it */
> + pte = pte_offset_map(pmd, _address);
> + }
> + }
> + pte--;
> + pte_unmap(pte);
> + trace_mm_collapse_huge_page_swapin(mm, vma->vm_start, swap_pte);
> +}

This is doing a series of synchronous reads. That will be sloooow on
spinning disks.

This function should be significantly faster if it first gets all the
necessary I/O underway. I don't think we have a function which exactly
does this. Perhaps generalise swapin_readahead() or open-code
something like

blk_start_plug(...);
for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
pte++, _address += PAGE_SIZE) {
if (is_swap_pte(*pte)) {
read_swap_cache_async(...);
}
}
blk_finish_plug(...);


If you do make a change such as this, please benchmark its effects.
Not on SSD ;)

2015-06-17 03:20:38

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate

On 06/16/2015 05:15 PM, Andrew Morton wrote:
> On Sun, 14 Jun 2015 18:04:43 +0300 Ebru Akagunduz <[email protected]> wrote:
>
>> This patch makes swapin readahead to improve thp collapse rate.
>> When khugepaged scanned pages, there can be a few of the pages
>> in swap area.
>>
>> With the patch THP can collapse 4kB pages into a THP when
>> there are up to max_ptes_swap swap ptes in a 2MB range.
>>
>> The patch was tested with a test program that allocates
>> 800MB of memory, writes to it, and then sleeps. I force
>> the system to swap out all. Afterwards, the test program
>> touches the area by writing, it skips a page in each
>> 20 pages of the area.
>>
>> Without the patch, system did not swap in readahead.
>> THP rate was %47 of the program of the memory, it
>> did not change over time.
>>
>> With this patch, after 10 minutes of waiting khugepaged had
>> collapsed %99 of the program's memory.
>>
>> ...
>>
>> +/*
>> + * Bring missing pages in from swap, to complete THP collapse.
>> + * Only done if khugepaged_scan_pmd believes it is worthwhile.
>> + *
>> + * Called and returns without pte mapped or spinlocks held,
>> + * but with mmap_sem held to protect against vma changes.
>> + */
>> +
>> +static void __collapse_huge_page_swapin(struct mm_struct *mm,
>> + struct vm_area_struct *vma,
>> + unsigned long address, pmd_t *pmd,
>> + pte_t *pte)
>> +{
>> + unsigned long _address;
>> + pte_t pteval = *pte;
>> + int swap_pte = 0;
>> +
>> + pte = pte_offset_map(pmd, address);
>> + for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
>> + pte++, _address += PAGE_SIZE) {
>> + pteval = *pte;
>> + if (is_swap_pte(pteval)) {
>> + swap_pte++;
>> + do_swap_page(mm, vma, _address, pte, pmd, 0x0, pteval);
>> + /* pte is unmapped now, we need to map it */
>> + pte = pte_offset_map(pmd, _address);
>> + }
>> + }
>> + pte--;
>> + pte_unmap(pte);
>> + trace_mm_collapse_huge_page_swapin(mm, vma->vm_start, swap_pte);
>> +}
>
> This is doing a series of synchronous reads. That will be sloooow on
> spinning disks.
>
> This function should be significantly faster if it first gets all the
> necessary I/O underway. I don't think we have a function which exactly
> does this. Perhaps generalise swapin_readahead() or open-code
> something like

Looking at do_swap_page() and __lock_page_or_retry(), I guess
there already is a way to do the above.

Passing a "flags" of FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT
to do_swap_page() should result in do_swap_page() returning with
the pte unmapped and the mmap_sem still held if the page was not
immediately available to map into the pte (trylock_page succeeds).

Ebru, can you try passing the above as the flags argument to
do_swap_page(), and see what happens?

--
All rights reversed

2015-06-17 17:39:11

by Ebru Akagunduz

[permalink] [raw]
Subject: Re: [RFC 3/3] mm: make swapin readahead to improve thp collapse rate

On Tue, Jun 16, 2015 at 11:20:20PM -0400, Rik van Riel wrote:
> On 06/16/2015 05:15 PM, Andrew Morton wrote:
> > On Sun, 14 Jun 2015 18:04:43 +0300 Ebru Akagunduz <[email protected]> wrote:
> >
> >> This patch makes swapin readahead to improve thp collapse rate.
> >> When khugepaged scanned pages, there can be a few of the pages
> >> in swap area.
> >>
> >> With the patch THP can collapse 4kB pages into a THP when
> >> there are up to max_ptes_swap swap ptes in a 2MB range.
> >>
> >> The patch was tested with a test program that allocates
> >> 800MB of memory, writes to it, and then sleeps. I force
> >> the system to swap out all. Afterwards, the test program
> >> touches the area by writing, it skips a page in each
> >> 20 pages of the area.
> >>
> >> Without the patch, system did not swap in readahead.
> >> THP rate was %47 of the program of the memory, it
> >> did not change over time.
> >>
> >> With this patch, after 10 minutes of waiting khugepaged had
> >> collapsed %99 of the program's memory.
> >>
> >> ...
> >>
> >> +/*
> >> + * Bring missing pages in from swap, to complete THP collapse.
> >> + * Only done if khugepaged_scan_pmd believes it is worthwhile.
> >> + *
> >> + * Called and returns without pte mapped or spinlocks held,
> >> + * but with mmap_sem held to protect against vma changes.
> >> + */
> >> +
> >> +static void __collapse_huge_page_swapin(struct mm_struct *mm,
> >> + struct vm_area_struct *vma,
> >> + unsigned long address, pmd_t *pmd,
> >> + pte_t *pte)
> >> +{
> >> + unsigned long _address;
> >> + pte_t pteval = *pte;
> >> + int swap_pte = 0;
> >> +
> >> + pte = pte_offset_map(pmd, address);
> >> + for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
> >> + pte++, _address += PAGE_SIZE) {
> >> + pteval = *pte;
> >> + if (is_swap_pte(pteval)) {
> >> + swap_pte++;
> >> + do_swap_page(mm, vma, _address, pte, pmd, 0x0, pteval);
> >> + /* pte is unmapped now, we need to map it */
> >> + pte = pte_offset_map(pmd, _address);
> >> + }
> >> + }
> >> + pte--;
> >> + pte_unmap(pte);
> >> + trace_mm_collapse_huge_page_swapin(mm, vma->vm_start, swap_pte);
> >> +}
> >
> > This is doing a series of synchronous reads. That will be sloooow on
> > spinning disks.
> >
> > This function should be significantly faster if it first gets all the
> > necessary I/O underway. I don't think we have a function which exactly
> > does this. Perhaps generalise swapin_readahead() or open-code
> > something like
>
> Looking at do_swap_page() and __lock_page_or_retry(), I guess
> there already is a way to do the above.
>
> Passing a "flags" of FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT
> to do_swap_page() should result in do_swap_page() returning with
> the pte unmapped and the mmap_sem still held if the page was not
> immediately available to map into the pte (trylock_page succeeds).
>
> Ebru, can you try passing the above as the flags argument to
> do_swap_page(), and see what happens?

I will try and resent the patch series.

Thanks for suggestions.

Ebru