This patch series fixes unnecessarry resource consuming
in khugepaged swapin and introduces a new function to
calculate value of specific vm event.
Ebru Akagunduz (2):
mm, vmstat: calculate particular vm event
mm, thp: avoid unnecessary swapin in khugepaged
include/linux/vmstat.h | 6 ++++++
mm/huge_memory.c | 13 +++++++++++--
mm/vmstat.c | 12 ++++++++++++
3 files changed, 29 insertions(+), 2 deletions(-)
--
1.9.1
Currently, vmstat can calculate specific vm event with all_vm_events()
however it allocates all vm events to stack. This patch introduces
a helper to sum value of a specific vm event over all cpu, without
loading all the events.
Signed-off-by: Ebru Akagunduz <[email protected]>
Suggested-by: Kirill A. Shutemov <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
Changes in v2:
- this patch newly created in this version
- create sum event function to
calculate particular vm event (Kirill A. Shutemov)
Changes in v3:
- add dummy definition of sum_vm_event
when CONFIG_VM_EVENTS is not set
(Kirill A. Shutemov)
Changes in v4:
- add Suggested-by tag (Vlastimil Babka)
include/linux/vmstat.h | 6 ++++++
mm/vmstat.c | 12 ++++++++++++
2 files changed, 18 insertions(+)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 73fae8c..e5ec287 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -53,6 +53,8 @@ static inline void count_vm_events(enum vm_event_item item, long delta)
extern void all_vm_events(unsigned long *);
+extern unsigned long sum_vm_event(enum vm_event_item item);
+
extern void vm_events_fold_cpu(int cpu);
#else
@@ -73,6 +75,10 @@ static inline void __count_vm_events(enum vm_event_item item, long delta)
static inline void all_vm_events(unsigned long *ret)
{
}
+static inline unsigned long sum_vm_event(enum vm_event_item item)
+{
+ return 0;
+}
static inline void vm_events_fold_cpu(int cpu)
{
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 5e43004..b76d664 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -34,6 +34,18 @@
DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
EXPORT_PER_CPU_SYMBOL(vm_event_states);
+unsigned long sum_vm_event(enum vm_event_item item)
+{
+ int cpu;
+ unsigned long ret = 0;
+
+ get_online_cpus();
+ for_each_online_cpu(cpu)
+ ret += per_cpu(vm_event_states, cpu).event[item];
+ put_online_cpus();
+ return ret;
+}
+
static void sum_vm_events(unsigned long *ret)
{
int cpu;
--
1.9.1
Currently khugepaged makes swapin readahead to improve
THP collapse rate. This patch checks vm statistics
to avoid workload of swapin, if unnecessary. So that
when system under pressure, khugepaged won't consume
resources to swapin.
The patch was tested with a test program that allocates
800MB of memory, writes to it, and then sleeps. The system
was forced to swap out all. Afterwards, the test program
touches the area by writing, it skips a page in each
20 pages of the area. When waiting to swapin readahead
left part of the test, the memory forced to be busy
doing page reclaim. There was enough free memory during
test, khugepaged did not swapin readahead due to business.
Test results:
After swapped out
-------------------------------------------------------------------
| Anonymous | AnonHugePages | Swap | Fraction |
-------------------------------------------------------------------
With patch | 217888 kB | 217088 kB | 582112 kB | %99 |
-------------------------------------------------------------------
Without patch | 351308 kB | 350208 kB | 448692 kB | %99 |
-------------------------------------------------------------------
After swapped in (waiting 10 minutes)
-------------------------------------------------------------------
| Anonymous | AnonHugePages | Swap | Fraction |
-------------------------------------------------------------------
With patch | 604440 kB | 348160 kB | 195560 kB | %57 |
-------------------------------------------------------------------
Without patch | 586816 kB | 464896 kB | 213184 kB | %79 |
-------------------------------------------------------------------
Signed-off-by: Ebru Akagunduz <[email protected]>
Fixes: 363cd76e5b11c ("mm: make swapin readahead to improve thp collapse rate")
---
Changes in v2:
- Add reference to specify which patch fixed (Ebru Akagunduz)
- Fix commit subject line (Ebru Akagunduz)
Changes in v3:
- Remove default values of allocstall (Kirill A. Shutemov)
Changes in v4:
- define unsigned long allocstall instead of unsigned long int
(Vlastimil Babka)
- compare allocstall when khugepaged goes to sleep
(Rik van Riel, Vlastimil Babka)
mm/huge_memory.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 86e9666..104faa1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -102,6 +102,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
*/
static unsigned int khugepaged_max_ptes_none __read_mostly;
static unsigned int khugepaged_max_ptes_swap __read_mostly;
+static unsigned long allocstall;
static int khugepaged(void *none);
static int khugepaged_slab_init(void);
@@ -2438,7 +2439,7 @@ static void collapse_huge_page(struct mm_struct *mm,
struct page *new_page;
spinlock_t *pmd_ptl, *pte_ptl;
int isolated = 0, result = 0;
- unsigned long hstart, hend;
+ unsigned long hstart, hend, swap, curr_allocstall;
struct mem_cgroup *memcg;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */
@@ -2493,7 +2494,14 @@ static void collapse_huge_page(struct mm_struct *mm,
goto out;
}
- __collapse_huge_page_swapin(mm, vma, address, pmd);
+ swap = get_mm_counter(mm, MM_SWAPENTS);
+ curr_allocstall = sum_vm_event(ALLOCSTALL);
+ /*
+ * When system under pressure, don't swapin readahead.
+ * So that avoid unnecessary resource consuming.
+ */
+ if (allocstall == curr_allocstall && swap != 0)
+ __collapse_huge_page_swapin(mm, vma, address, pmd);
anon_vma_lock_write(vma->anon_vma);
@@ -2905,6 +2913,7 @@ static int khugepaged(void *none)
set_user_nice(current, MAX_NICE);
while (!kthread_should_stop()) {
+ allocstall = sum_vm_event(ALLOCSTALL);
khugepaged_do_scan();
khugepaged_wait_work();
}
--
1.9.1
[I am sorry I haven't responded sooner but I was busy with other stuff]
On Sun 20-03-16 20:07:39, Ebru Akagunduz wrote:
> Currently khugepaged makes swapin readahead to improve
> THP collapse rate. This patch checks vm statistics
> to avoid workload of swapin, if unnecessary. So that
> when system under pressure, khugepaged won't consume
> resources to swapin.
OK, so you want to disable the optimization when under the memory
pressure. That sounds like a good idea in general.
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. The system
> was forced to swap out all. Afterwards, the test program
> touches the area by writing, it skips a page in each
> 20 pages of the area. When waiting to swapin readahead
> left part of the test, the memory forced to be busy
> doing page reclaim. There was enough free memory during
> test, khugepaged did not swapin readahead due to business.
>
> Test results:
>
> After swapped out
> -------------------------------------------------------------------
> | Anonymous | AnonHugePages | Swap | Fraction |
> -------------------------------------------------------------------
> With patch | 217888 kB | 217088 kB | 582112 kB | %99 |
> -------------------------------------------------------------------
> Without patch | 351308 kB | 350208 kB | 448692 kB | %99 |
> -------------------------------------------------------------------
>
> After swapped in (waiting 10 minutes)
> -------------------------------------------------------------------
> | Anonymous | AnonHugePages | Swap | Fraction |
> -------------------------------------------------------------------
> With patch | 604440 kB | 348160 kB | 195560 kB | %57 |
> -------------------------------------------------------------------
> Without patch | 586816 kB | 464896 kB | 213184 kB | %79 |
> -------------------------------------------------------------------
I am not really sure I understand these results. The system indeed
swapped in much less but how come the Fraction is much higher when
__collapse_huge_page_swapin was called less?
> Signed-off-by: Ebru Akagunduz <[email protected]>
> Fixes: 363cd76e5b11c ("mm: make swapin readahead to improve thp collapse rate")
This doesn't exist in the Linus tree. So I guess this is a reference to
linux-next. If that is the case then just drop the Fixes part as the sha
is not stable and this will become confusing later on.
[...]
> @@ -2438,7 +2439,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> struct page *new_page;
> spinlock_t *pmd_ptl, *pte_ptl;
> int isolated = 0, result = 0;
> - unsigned long hstart, hend;
> + unsigned long hstart, hend, swap, curr_allocstall;
> struct mem_cgroup *memcg;
> unsigned long mmun_start; /* For mmu_notifiers */
> unsigned long mmun_end; /* For mmu_notifiers */
> @@ -2493,7 +2494,14 @@ static void collapse_huge_page(struct mm_struct *mm,
> goto out;
> }
>
> - __collapse_huge_page_swapin(mm, vma, address, pmd);
> + swap = get_mm_counter(mm, MM_SWAPENTS);
> + curr_allocstall = sum_vm_event(ALLOCSTALL);
> + /*
> + * When system under pressure, don't swapin readahead.
> + * So that avoid unnecessary resource consuming.
> + */
> + if (allocstall == curr_allocstall && swap != 0)
> + __collapse_huge_page_swapin(mm, vma, address, pmd);
this criteria doesn't really make much sense to me. So we are checking
whether there was the direct reclaim invoked since some point in time
(more on that below) and we take that as a signal of a strong memory
pressure, right? What if that was quite some time ago? What if we didn't
have a single direct reclaim but the kswapd was busy the whole time. Or
what if the allocstall was from a different numa node?
>
> anon_vma_lock_write(vma->anon_vma);
>
> @@ -2905,6 +2913,7 @@ static int khugepaged(void *none)
> set_user_nice(current, MAX_NICE);
>
> while (!kthread_should_stop()) {
> + allocstall = sum_vm_event(ALLOCSTALL);
> khugepaged_do_scan();
And this sounds even buggy AFAIU. I guess you want to snapshot before
goint to sleep no? Otherwise you are comparing allocstall diff from a
very short time period. Or was this an intention and you really want to
watch for events while khugepaged is running? If yes a comment would be
due here.
> khugepaged_wait_work();
> }
That being said, is this actually useful in the real life? Basing your
decision on something as volatile as the direct reclaim would lead to
rather volatile results. E.g. how stable are the numbers during your
test?
Wouldn't it be better to rather do an optimistic swapin and back out
if the direct reclaim is really required. I realize this will be a much
bigger change but it would make more sense I guess.
--
Michal Hocko
SUSE Labs
On Mon, 2016-03-21 at 16:36 +0100, Michal Hocko wrote:
> On Sun 20-03-16 20:07:39, Ebru Akagunduz wrote:
> >
> > Currently khugepaged makes swapin readahead to improve
> > THP collapse rate. This patch checks vm statistics
> > to avoid workload of swapin, if unnecessary. So that
> > when system under pressure, khugepaged won't consume
> > resources to swapin.
> OK, so you want to disable the optimization when under the memory
> pressure. That sounds like a good idea in general.
>
> > @@ -2493,7 +2494,14 @@ static void collapse_huge_page(struct
> > mm_struct *mm,
> > goto out;
> > }
> >
> > - __collapse_huge_page_swapin(mm, vma, address, pmd);
> > + swap = get_mm_counter(mm, MM_SWAPENTS);
> > + curr_allocstall = sum_vm_event(ALLOCSTALL);
> > + /*
> > + * When system under pressure, don't swapin readahead.
> > + * So that avoid unnecessary resource consuming.
> > + */
> > + if (allocstall == curr_allocstall && swap != 0)
> > + __collapse_huge_page_swapin(mm, vma, address,
> > pmd);
> this criteria doesn't really make much sense to me. So we are
> checking
> whether there was the direct reclaim invoked since some point in time
> (more on that below) and we take that as a signal of a strong memory
> pressure, right? What if that was quite some time ago? What if we
> didn't
> have a single direct reclaim but the kswapd was busy the whole time.
> Or
> what if the allocstall was from a different numa node?
Do you have a measure in mind that the code should test
against, instead?
I don't think we want page cache turnover to prevent
khugepaged collapsing THPs, but if the system gets
to the point where kswapd is doing pageout IO, or
swapout IO, or kswapd cannot keep up, we should
probably slow down khugepaged.
If another NUMA node is under significant memory
pressure, we probably want the programs from that
node to be able to do some allocations from this
node, rather than have khugepaged consume the memory.
> > anon_vma_lock_write(vma->anon_vma);
> >
> > @@ -2905,6 +2913,7 @@ static int khugepaged(void *none)
> > set_user_nice(current, MAX_NICE);
> >
> > while (!kthread_should_stop()) {
> > + allocstall = sum_vm_event(ALLOCSTALL);
> > khugepaged_do_scan();
> And this sounds even buggy AFAIU. I guess you want to snapshot before
> goint to sleep no? Otherwise you are comparing allocstall diff from a
> very short time period. Or was this an intention and you really want
> to
> watch for events while khugepaged is running? If yes a comment would
> be
> due here.
You are right, the snapshot should be taken after
khugepaged_do_work().
The memory pressure needs to be measured over the
longest time possible between khugepaged runs.
> That being said, is this actually useful in the real life? Basing
> your
> decision on something as volatile as the direct reclaim would lead to
> rather volatile results. E.g. how stable are the numbers during your
> test?
>
> Wouldn't it be better to rather do an optimistic swapin and back out
> if the direct reclaim is really required. I realize this will be a
> much
> bigger change but it would make more sense I guess.
That depends on how costly swap IO is.
Having khugepaged be on the conservative side is probably
a good idea, given how many systems out there still have
hard drives today.
--
All Rights Reversed.
On Tue 22-03-16 15:21:16, Rik van Riel wrote:
> On Mon, 2016-03-21 at 16:36 +0100, Michal Hocko wrote:
> > On Sun 20-03-16 20:07:39, Ebru Akagunduz wrote:
> > >
> > > Currently khugepaged makes swapin readahead to improve
> > > THP collapse rate. This patch checks vm statistics
> > > to avoid workload of swapin, if unnecessary. So that
> > > when system under pressure, khugepaged won't consume
> > > resources to swapin.
> > OK, so you want to disable the optimization when under the memory
> > pressure. That sounds like a good idea in general.
> >?
> > > @@ -2493,7 +2494,14 @@ static void collapse_huge_page(struct
> > > mm_struct *mm,
> > > ? goto out;
> > > ? }
> > > ?
> > > - __collapse_huge_page_swapin(mm, vma, address, pmd);
> > > + swap = get_mm_counter(mm, MM_SWAPENTS);
> > > + curr_allocstall = sum_vm_event(ALLOCSTALL);
> > > + /*
> > > + ?* When system under pressure, don't swapin readahead.
> > > + ?* So that avoid unnecessary resource consuming.
> > > + ?*/
> > > + if (allocstall == curr_allocstall && swap != 0)
> > > + __collapse_huge_page_swapin(mm, vma, address,
> > > pmd);
> > this criteria doesn't really make much sense to me. So we are
> > checking
> > whether there was the direct reclaim invoked since some point in time
> > (more on that below) and we take that as a signal of a strong memory
> > pressure, right? What if that was quite some time ago? What if we
> > didn't
> > have a single direct reclaim but the kswapd was busy the whole time.
> > Or
> > what if the allocstall was from a different numa node?
>
> Do you have a measure in mind that the code should test
> against, instead?
vmpressure provides a reclaim pressure feedback. I am not sure it could
be used here, though.
> I don't think we want page cache turnover to prevent
> khugepaged collapsing THPs, but if the system gets
> to the point where kswapd is doing pageout IO, or
> swapout IO, or kswapd cannot keep up, we should
> probably slow down khugepaged.
I agree. Would using gfp_mask & ~___GFP_DIRECT_RECLAIM allocation
requests for the opportunistic swapin be something to try out? If the
kswapd doesn't keep up with the load to the point when we have to enter
the direct reclaim then it doesn't really make sense to increase the
memory pressure but additional direct reclaim.
> If another NUMA node is under significant memory
> pressure, we probably want the programs from that
> node to be able to do some allocations from this
> node, rather than have khugepaged consume the memory.
This is hard to tell because those tasks might be bound to that node
and won't leave it. Anyway I just wanted to point out that relying to
a global counter is rather dubious.
--
Michal Hocko
SUSE Labs