LinuxLists.cc - [RFC PATCH 4/4] mm: Add PG

2020-04-12 09:18:01

Subject: [RFC PATCH 4/4] mm: Add PG_zero support

Zero out the page content usually happens when allocating pages,
this is a time consuming operation, it makes pin and mlock
operation very slowly, especially for a large batch of memory.

This patch introduce a new feature for zero out pages before page
allocation, it can help to speed up page allocation.

The idea is very simple, zero out free pages when the system is
not busy and mark the page with PG_zero, when allocating a page,
if the page need to be filled with zero, check the flag in the
struct page, if it's marked as PG_zero, zero out can be skipped,
it can save cpu time and speed up page allocation.

This serial is based on the feature 'free page reporting' which
introduced by Alexander Duyck

We can benefit from this feature in the flowing case:
1. User space mlock a large chunk of memory
2. VFIO pin pages for DMA
3. Allocating transparent huge page
4. Speed up page fault process

My original intention for adding this feature is to shorten
VM creation time when VFIO device is attached, it works good
and the VM creation time is reduced obviously.

Cc: Alexander Duyck <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Alex Williamson <[email protected]>
Signed-off-by: liliangleo <[email protected]>
---
include/linux/highmem.h | 31 ++++++++-
include/linux/page-flags.h | 18 ++++-
include/trace/events/mmflags.h | 7 ++
mm/Kconfig | 10 +++
mm/Makefile | 1 +
mm/huge_memory.c | 3 +-
mm/page_alloc.c | 2 +
mm/zero_page.c | 151 +++++++++++++++++++++++++++++++++++++++++
mm/zero_page.h | 13 ++++
9 files changed, 231 insertions(+), 5 deletions(-)
create mode 100644 mm/zero_page.c
create mode 100644 mm/zero_page.h

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ea5cdbd8c2c3..0308837adc19 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -157,7 +157,13 @@ do { \
#ifndef clear_user_highpage
static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
{
- void *addr = kmap_atomic(page);
+ void *addr;
+
+#ifdef CONFIG_ZERO_PAGE
+ if (TestClearPageZero(page))
+ return;
+#endif
+ addr = kmap_atomic(page);
clear_user_page(addr, vaddr, page);
kunmap_atomic(addr);
}
@@ -208,9 +214,30 @@ alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
return __alloc_zeroed_user_highpage(__GFP_MOVABLE, vma, vaddr);
}

+#ifdef CONFIG_ZERO_PAGE
+static inline void __clear_highpage(struct page *page)
+{
+ void *kaddr;
+
+ if (PageZero(page))
+ return;
+
+ kaddr = kmap_atomic(page);
+ clear_page(kaddr);
+ SetPageZero(page);
+ kunmap_atomic(kaddr);
+}
+#endif
+
static inline void clear_highpage(struct page *page)
{
- void *kaddr = kmap_atomic(page);
+ void *kaddr;
+
+#ifdef CONFIG_ZERO_PAGE
+ if (TestClearPageZero(page))
+ return;
+#endif
+ kaddr = kmap_atomic(page);
clear_page(kaddr);
kunmap_atomic(kaddr);
}
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 222f6f7b2bb3..ace247c5d3ec 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -136,6 +136,10 @@ enum pageflags {
PG_young,
PG_idle,
#endif
+#ifdef CONFIG_ZERO_PAGE
+ PG_zero,
+#endif
+
__NR_PAGEFLAGS,

/* Filesystems */
@@ -447,6 +451,16 @@ PAGEFLAG(Idle, idle, PF_ANY)
*/
__PAGEFLAG(Reported, reported, PF_NO_COMPOUND)

+#ifdef CONFIG_ZERO_PAGE
+PAGEFLAG(Zero, zero, PF_ANY)
+TESTSCFLAG(Zero, zero, PF_ANY)
+#define __PG_ZERO (1UL << PG_zero)
+#else
+PAGEFLAG_FALSE(Zero)
+#define __PG_ZERO 0
+#endif
+
+
/*
* On an anonymous page mapped into a user virtual memory area,
* page->mapping points to its anon_vma, not to a struct address_space;
@@ -843,7 +857,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
1UL << PG_private | 1UL << PG_private_2 | \
1UL << PG_writeback | 1UL << PG_reserved | \
1UL << PG_slab | 1UL << PG_active | \
- 1UL << PG_unevictable | __PG_MLOCKED)
+ 1UL << PG_unevictable | __PG_MLOCKED | __PG_ZERO)

/*
* Flags checked when a page is prepped for return by the page allocator.
@@ -854,7 +868,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
* alloc-free cycle to prevent from reusing the page.
*/
#define PAGE_FLAGS_CHECK_AT_PREP \
- (((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON)
+ (((1UL << NR_PAGEFLAGS) - 1) & ~(__PG_HWPOISON | __PG_ZERO))

#define PAGE_FLAGS_PRIVATE \
(1UL << PG_private | 1UL << PG_private_2)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 5fb752034386..7be4153bed2c 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -73,6 +73,12 @@
#define IF_HAVE_PG_HWPOISON(flag,string)
#endif

+#ifdef CONFIG_ZERO_PAGE
+#define IF_HAVE_PG_ZERO(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_ZERO(flag,string)
+#endif
+
#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
#define IF_HAVE_PG_IDLE(flag,string) ,{1UL << flag, string}
#else
@@ -104,6 +110,7 @@
IF_HAVE_PG_MLOCK(PG_mlocked, "mlocked" ) \
IF_HAVE_PG_UNCACHED(PG_uncached, "uncached" ) \
IF_HAVE_PG_HWPOISON(PG_hwpoison, "hwpoison" ) \
+IF_HAVE_PG_ZERO(PG_zero, "zero" ) \
IF_HAVE_PG_IDLE(PG_young, "young" ) \
IF_HAVE_PG_IDLE(PG_idle, "idle" )

diff --git a/mm/Kconfig b/mm/Kconfig
index c1acc34c1c35..3806bdbff4c9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -252,6 +252,16 @@ config PAGE_REPORTING
those pages to another entity, such as a hypervisor, so that the
memory can be freed within the host for other uses.

+#
+# support for zero free page
+config ZERO_PAGE
+ bool "Zero free page"
+ def_bool y
+ depends on PAGE_REPORTING
+ help
+ Zero page allows zero out free pages in freelist based on free
+ page reporting
+
#
# support for page migration
#
diff --git a/mm/Makefile b/mm/Makefile
index fccd3756b25f..ee23147a623f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -112,3 +112,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_ZERO_PAGE) += zero_page.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6ecd1045113b..a28707aea3c5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2542,7 +2542,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
(1L << PG_workingset) |
(1L << PG_locked) |
(1L << PG_unevictable) |
- (1L << PG_dirty)));
+ (1L << PG_dirty) |
+ __PG_ZERO));

/* ->mapping in first tail page is compound_mapcount */
VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 69827d4fa052..3e9601d0b944 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -75,6 +75,7 @@
#include "internal.h"
#include "shuffle.h"
#include "page_reporting.h"
+#include "zero_page.h"

/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
static DEFINE_MUTEX(pcp_batch_high_lock);
@@ -1179,6 +1180,7 @@ static __always_inline bool free_pages_prepare(struct page *page,

trace_mm_page_free(page, order);

+ clear_zero_page_flag(page, order);
/*
* Check tail pages before head page information is cleared to
* avoid checking PageCompound for order-0 pages.
diff --git a/mm/zero_page.c b/mm/zero_page.c
new file mode 100644
index 000000000000..f3b3d58f0ef2
--- /dev/null
+++ b/mm/zero_page.c
@@ -0,0 +1,151 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Copyright (C) 2020 Didi chuxing.
+ *
+ * Authors: Liang Li <[email protected]>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/rmap.h>
+#include <linux/mm_inline.h>
+#include <linux/page_reporting.h>
+#include "internal.h"
+#include "zero_page.h"
+
+#define ZERO_PAGE_STOP 0
+#define ZERO_PAGE_RUN 1
+
+static unsigned long zeropage_enable __read_mostly;
+static DEFINE_MUTEX(kzeropaged_mutex);
+static struct page_reporting_dev_info zero_page_dev_info;
+
+inline void clear_zero_page_flag(struct page *page, int order)
+{
+ int i;
+
+ for (i = 0; i < (1 << order); i++)
+ ClearPageZero(page + i);
+}
+
+static int zero_free_pages(struct page_reporting_dev_info *pr_dev_info,
+ struct scatterlist *sgl, unsigned int nents)
+{
+ struct scatterlist *sg = sgl;
+
+ might_sleep();
+ do {
+ struct page *page = sg_page(sg);
+ unsigned int order = get_order(sg->length);
+ int i;
+
+ VM_BUG_ON(PageBuddy(page) || page_order(page));
+
+ for (i = 0; i < (1 << order); i++) {
+ cond_resched();
+ __clear_highpage(page + i);
+ }
+ } while ((sg = sg_next(sg)));
+
+ return 0;
+}
+
+static int start_kzeropaged(void)
+{
+ int err = 0;
+
+ if (zeropage_enable) {
+ zero_page_dev_info.report = zero_free_pages;
+ err = page_reporting_register(&zero_page_dev_info);
+ pr_info("Zero page enabled\n");
+ } else {
+ page_reporting_unregister(&zero_page_dev_info);
+ pr_info("Zero page disabled\n");
+ }
+
+ return err;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%lu\n", zeropage_enable);
+}
+
+static ssize_t enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ ssize_t ret = 0;
+ unsigned long flags;
+ int err;
+
+ err = kstrtoul(buf, 10, &flags);
+ if (err || flags > UINT_MAX)
+ return -EINVAL;
+ if (flags > ZERO_PAGE_RUN)
+ return -EINVAL;
+
+ if (zeropage_enable != flags) {
+ mutex_lock(&kzeropaged_mutex);
+ zeropage_enable = flags;
+ ret = start_kzeropaged();
+ mutex_unlock(&kzeropaged_mutex);
+ }
+
+ return count;
+}
+
+static struct kobj_attribute enabled_attr =
+ __ATTR(enabled, 0644, enabled_show, enabled_store);
+
+static struct attribute *zeropage_attr[] = {
+ &enabled_attr.attr,
+ NULL,
+};
+
+static struct attribute_group zeropage_attr_group = {
+ .attrs = zeropage_attr,
+};
+
+static int __init zeropage_init_sysfs(struct kobject **zeropage_kobj)
+{
+ int err;
+
+ *zeropage_kobj = kobject_create_and_add("zero_page", mm_kobj);
+ if (unlikely(!*zeropage_kobj)) {
+ pr_err("zeropage: failed to create zeropage kobject\n");
+ return -ENOMEM;
+ }
+
+ err = sysfs_create_group(*zeropage_kobj, &zeropage_attr_group);
+ if (err) {
+ pr_err("zeropage: failed to register zeropage group\n");
+ goto delete_obj;
+ }
+
+ return 0;
+
+delete_obj:
+ kobject_put(*zeropage_kobj);
+ return err;
+}
+
+static int __init zeropage_init(void)
+{
+ int err;
+ struct kobject *zeropage_kobj;
+
+ err = zeropage_init_sysfs(&zeropage_kobj);
+ if (err)
+ return err;
+
+ start_kzeropaged();
+
+ return 0;
+}
+subsys_initcall(zeropage_init);
diff --git a/mm/zero_page.h b/mm/zero_page.h
new file mode 100644
index 000000000000..bfa3c9fe94d3
--- /dev/null
+++ b/mm/zero_page.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_ZERO_PAGE_H
+#define _LINUX_ZERO_PAGE_H
+
+#ifdef CONFIG_ZERO_PAGE
+extern inline void clear_zero_page_flag(struct page *page, int order);
+#else
+inline void clear_zero_page_flag(struct page *page, int order)
+{
+}
+#endif
+#endif /*_LINUX_ZERO_NG_H */
+
--
2.14.1

2020-04-12 10:13:26

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [RFC PATCH 4/4] mm: Add PG_zero support

On Sun, Apr 12, 2020 at 05:09:49AM -0400, liliangleo wrote:
> Zero out the page content usually happens when allocating pages,
> this is a time consuming operation, it makes pin and mlock
> operation very slowly, especially for a large batch of memory.
>
> This patch introduce a new feature for zero out pages before page
> allocation, it can help to speed up page allocation.
>
> The idea is very simple, zero out free pages when the system is
> not busy and mark the page with PG_zero, when allocating a page,
> if the page need to be filled with zero, check the flag in the
> struct page, if it's marked as PG_zero, zero out can be skipped,
> it can save cpu time and speed up page allocation.

We are very short on bits in the page flags. If we can implement this
feature without using another one, this would be good.

If the bit is only set on pages which are PageBuddy(), we can definitely
find space for it as an alias of another bit.

2020-04-14 13:38:41

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC PATCH 4/4] mm: Add PG_zero support

On Mon, 13 Apr 2020 08:11:59 -0700 Alexander Duyck <[email protected]> wrote:

> In addition, unlike madvising the page away there is a pretty
> significant performance penalty for having to clear the page a second
> time when the page is split or merged.

I wonder if there might be an issue with increased memory traffic (and
increased energy consumption, etc). If a page is zeroed immediately
before getting data written into it (eg, plain old file write(),
anonymous pagefault) then we can expect that those 4096 zeroes will be
in CPU cache and mostly not written back. But if that page was zeroed
a "long" time ago, the caches will probably have been written back.
Net result: we go from 4k of memory traffic for a 4k page up to 8k of
memory traffic?

Also, the name CONFIG_ZERO_PAGE sounds like it has something to do with
the long established "zero page". Confusing. CONFIG_PREZERO_PAGE,
maybe?

2020-04-14 18:24:20

by Alexander Duyck

[permalink] [raw]

Subject: Re: [RFC PATCH 4/4] mm: Add PG_zero support

On 4/12/2020 3:12 AM, Matthew Wilcox wrote:
> On Sun, Apr 12, 2020 at 05:09:49AM -0400, liliangleo wrote:
>> Zero out the page content usually happens when allocating pages,
>> this is a time consuming operation, it makes pin and mlock
>> operation very slowly, especially for a large batch of memory.
>>
>> This patch introduce a new feature for zero out pages before page
>> allocation, it can help to speed up page allocation.
>>
>> The idea is very simple, zero out free pages when the system is
>> not busy and mark the page with PG_zero, when allocating a page,
>> if the page need to be filled with zero, check the flag in the
>> struct page, if it's marked as PG_zero, zero out can be skipped,
>> it can save cpu time and speed up page allocation.
>
> We are very short on bits in the page flags. If we can implement this
> feature without using another one, this would be good.
>
> If the bit is only set on pages which are PageBuddy(), we can definitely
> find space for it as an alias of another bit.

I had considered doing something similar several months back because one
of the side effects in the VM is that most of the pages appear to have
been zeroed by page reporting. However the problem is that in order to
handle the zeroing case you have to push the flag outside the PageBuddy
region, and you cannot guarantee that the page is even expected to be
zeroed since it might have been zeroed before it was freed, so it is
just adding work of having to clear an extra flag some time after
allocation.

In addition, unlike madvising the page away there is a pretty
significant performance penalty for having to clear the page a second
time when the page is split or merged.

2020-04-15 15:09:23

by Chen, Rong A

[permalink] [raw]

Subject: [mm] 5ae8a9d7c8: will-it-scale.per_thread_ops -2.1% regression

Greeting,

FYI, we noticed a -2.1% regression of will-it-scale.per_thread_ops due to commit:

commit: 5ae8a9d7c84e7e6fa64ccaa357a1351015f1457c ("[RFC PATCH 4/4] mm: Add PG_zero support")
url: https://github.com/0day-ci/linux/commits/liliangleo/mm-Add-PG_zero-support/20200412-172834

in testcase: will-it-scale
on test machine: 8 threads Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz with 16G memory
with following parameters:

nr_task: 100%
mode: thread
test: page_fault1
cpufreq_governor: performance
ucode: 0x21

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale

If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>

Details are as below:
-------------------------------------------------------------------------------------------------->

To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-7/performance/x86_64-rhel-7.6/thread/100%/debian-x86_64-20191114.cgz/lkp-ivb-d01/page_fault1/will-it-scale/0x21

commit:
0801ffd19f ("mm: add sys fs configuration for page reporting")
5ae8a9d7c8 ("mm: Add PG_zero support")

0801ffd19fa82207 5ae8a9d7c84e7e6fa64ccaa357a
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:4 25% 1:4 dmesg.RIP:__mnt_want_write
:4 25% 1:4 dmesg.RIP:loop
1:4 -25% :4 dmesg.RIP:poll_idle
1:4 -25% :4 kmsg.b44449d>]usb_hcd_irq
:4 25% 1:4 kmsg.b5a1a>]usb_hcd_irq
1:4 -25% :4 kmsg.c3ed91f>]usb_hcd_irq
:4 25% 1:4 kmsg.c6fae9f>]usb_hcd_irq
1:4 -25% :4 kmsg.c8b7ca>]usb_hcd_irq
%stddev %change %stddev
\ | \
536081 -2.1% 524633 will-it-scale.per_thread_ops
2523521 -2.1% 2470098 will-it-scale.time.minor_page_faults
107.85 -4.2% 103.29 will-it-scale.time.user_time
511850 -3.3% 495188 will-it-scale.time.voluntary_context_switches
4288652 -2.1% 4197068 will-it-scale.workload
4.87 +0.8 5.63 mpstat.cpu.all.idle%
3991 +10.2% 4397 ± 5% slabinfo.anon_vma.num_objs
142485 ± 8% -15.3% 120695 ± 3% softirqs.CPU7.TIMER
3005147 ± 5% +129.0% 6881725 ± 28% cpuidle.C1.time
66503 ± 2% +55.0% 103053 ± 22% cpuidle.C1.usage
32502781 ± 14% +26.6% 41156255 cpuidle.C3.time
23839328 ± 16% +45.3% 34642724 ± 8% cpuidle.C6.time
41675 ± 15% +58.6% 66107 ± 10% cpuidle.C6.usage
246196 -4.0% 236260 interrupts.CAL:Function_call_interrupts
8406 ± 32% +27.0% 10673 ± 27% interrupts.CPU4.NMI:Non-maskable_interrupts
8406 ± 32% +27.0% 10673 ± 27% interrupts.CPU4.PMI:Performance_monitoring_interrupts
11617 ± 21% -41.5% 6801 interrupts.CPU5.NMI:Non-maskable_interrupts
11617 ± 21% -41.5% 6801 interrupts.CPU5.PMI:Performance_monitoring_interrupts
320147 ± 22% -26.9% 233880 ± 2% sched_debug.cfs_rq:/.load.max
18580 ± 24% -32.6% 12525 ± 28% sched_debug.cpu.nr_switches.stddev
18333 ± 27% -34.9% 11930 ± 26% sched_debug.cpu.sched_count.stddev
8807 ± 28% -31.8% 6009 ± 19% sched_debug.cpu.ttwu_count.stddev
8775 ± 26% -31.9% 5973 ± 22% sched_debug.cpu.ttwu_local.stddev
5412715 -2.1% 5298836 proc-vmstat.numa_hit
5412715 -2.1% 5298836 proc-vmstat.numa_local
1.291e+09 -2.1% 1.265e+09 proc-vmstat.pgalloc_normal
2908824 -1.7% 2858835 proc-vmstat.pgfault
1.291e+09 -2.1% 1.265e+09 proc-vmstat.pgfree
2516340 -2.1% 2464229 proc-vmstat.thp_fault_alloc
86.60 -27.4 59.23 perf-profile.calltrace.cycles-pp.clear_page_erms.clear_subpage.clear_huge_page.do_huge_pmd_anonymous_page.__handle_mm_fault
95.16 -0.6 94.56 perf-profile.calltrace.cycles-pp.page_fault
94.97 -0.6 94.38 perf-profile.calltrace.cycles-pp.handle_mm_fault.do_page_fault.page_fault
95.12 -0.6 94.53 perf-profile.calltrace.cycles-pp.do_page_fault.page_fault
94.88 -0.6 94.29 perf-profile.calltrace.cycles-pp.do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault.do_page_fault.page_fault
94.95 -0.6 94.36 perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_page_fault.page_fault
90.91 -0.4 90.56 perf-profile.calltrace.cycles-pp.clear_huge_page.do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault.do_page_fault
3.37 -0.2 3.18 perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.do_huge_pmd_anonymous_page.__handle_mm_fault
3.38 -0.2 3.19 perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.alloc_pages_vma.do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault
3.38 -0.2 3.20 perf-profile.calltrace.cycles-pp.alloc_pages_vma.do_huge_pmd_anonymous_page.__handle_mm_fault.handle_mm_fault.do_page_fault
0.87 ± 2% -0.1 0.79 ± 4% perf-profile.calltrace.cycles-pp.rcu_all_qs._cond_resched.clear_huge_page.do_huge_pmd_anonymous_page.__handle_mm_fault
2.49 ± 4% +0.5 3.03 ± 4% perf-profile.calltrace.cycles-pp.munmap
2.48 ± 4% +0.5 3.02 ± 4% perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.munmap
2.48 ± 4% +0.5 3.02 ± 4% perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.munmap
2.48 ± 4% +0.5 3.03 ± 4% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.munmap
2.48 ± 4% +0.5 3.03 ± 4% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.munmap
2.44 ± 4% +0.5 2.99 ± 4% perf-profile.calltrace.cycles-pp.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.42 ± 4% +0.6 2.97 ± 4% perf-profile.calltrace.cycles-pp.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64
2.37 ± 4% +0.6 2.93 ± 4% perf-profile.calltrace.cycles-pp.tlb_finish_mmu.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap
2.37 ± 4% +0.6 2.93 ± 4% perf-profile.calltrace.cycles-pp.tlb_flush_mmu.tlb_finish_mmu.unmap_region.__do_munmap.__vm_munmap
2.32 ± 4% +0.6 2.88 ± 4% perf-profile.calltrace.cycles-pp.release_pages.tlb_flush_mmu.tlb_finish_mmu.unmap_region.__do_munmap
2.19 ± 4% +0.6 2.77 ± 4% perf-profile.calltrace.cycles-pp.__free_pages_ok.release_pages.tlb_flush_mmu.tlb_finish_mmu.unmap_region
0.00 +2.0 2.04 ± 2% perf-profile.calltrace.cycles-pp.clear_zero_page_flag.__free_pages_ok.release_pages.tlb_flush_mmu.tlb_finish_mmu
87.03 -27.5 59.52 perf-profile.children.cycles-pp.clear_page_erms
95.19 -0.6 94.59 perf-profile.children.cycles-pp.page_fault
95.14 -0.6 94.55 perf-profile.children.cycles-pp.do_page_fault
94.97 -0.6 94.38 perf-profile.children.cycles-pp.__handle_mm_fault
94.88 -0.6 94.29 perf-profile.children.cycles-pp.do_huge_pmd_anonymous_page
94.99 -0.6 94.41 perf-profile.children.cycles-pp.handle_mm_fault
90.97 -0.3 90.63 perf-profile.children.cycles-pp.clear_huge_page
88.69 -0.3 88.41 perf-profile.children.cycles-pp.clear_subpage
3.60 -0.2 3.40 perf-profile.children.cycles-pp.__alloc_pages_nodemask
3.58 -0.2 3.38 perf-profile.children.cycles-pp.get_page_from_freelist
3.40 -0.2 3.21 perf-profile.children.cycles-pp.alloc_pages_vma
0.71 ± 14% -0.2 0.53 ± 5% perf-profile.children.cycles-pp.apic_timer_interrupt
0.97 ± 3% -0.1 0.87 ± 3% perf-profile.children.cycles-pp._cond_resched
0.89 ± 2% -0.1 0.80 ± 3% perf-profile.children.cycles-pp.rcu_all_qs
0.65 ± 2% -0.0 0.60 ± 2% perf-profile.children.cycles-pp.prep_new_page
0.25 ± 9% -0.0 0.21 ± 7% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.50 -0.0 0.47 ± 2% perf-profile.children.cycles-pp.prep_compound_page
2.49 ± 4% +0.5 3.03 ± 4% perf-profile.children.cycles-pp.munmap
2.48 ± 4% +0.5 3.03 ± 4% perf-profile.children.cycles-pp.__vm_munmap
2.48 ± 4% +0.5 3.03 ± 4% perf-profile.children.cycles-pp.__x64_sys_munmap
2.38 ± 4% +0.5 2.93 ± 4% perf-profile.children.cycles-pp.tlb_flush_mmu
2.45 ± 4% +0.6 3.00 ± 4% perf-profile.children.cycles-pp.__do_munmap
2.42 ± 4% +0.6 2.98 ± 4% perf-profile.children.cycles-pp.unmap_region
2.38 ± 4% +0.6 2.94 ± 4% perf-profile.children.cycles-pp.tlb_finish_mmu
2.33 ± 4% +0.6 2.90 ± 4% perf-profile.children.cycles-pp.release_pages
2.78 ± 3% +0.6 3.35 ± 3% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
2.78 ± 3% +0.6 3.35 ± 3% perf-profile.children.cycles-pp.do_syscall_64
2.20 ± 4% +0.6 2.77 ± 3% perf-profile.children.cycles-pp.__free_pages_ok
0.00 +2.0 2.04 ± 3% perf-profile.children.cycles-pp.clear_zero_page_flag
86.48 -27.2 59.24 perf-profile.self.cycles-pp.clear_page_erms
2.09 ± 4% -1.5 0.63 ± 7% perf-profile.self.cycles-pp.__free_pages_ok
0.58 ± 2% -0.1 0.43 ± 7% perf-profile.self.cycles-pp.rcu_all_qs
0.50 ± 2% -0.0 0.47 ± 3% perf-profile.self.cycles-pp.prep_compound_page
0.37 ± 5% +0.0 0.41 ± 2% perf-profile.self.cycles-pp._cond_resched
0.00 +2.0 2.03 ± 3% perf-profile.self.cycles-pp.clear_zero_page_flag
1.87 ± 2% +27.0 28.87 perf-profile.self.cycles-pp.clear_subpage

will-it-scale.per_thread_ops

538000 +------------------------------------------------------------------+
| ++ + .++++.++++ ++ + ++ ++: +|
536000 |-+ ++ : +++.++ +.+ |
534000 |-+ +.++++.+ + |
| |
532000 |-+ |
| |
530000 |-+ |
| |
528000 |-+ |
526000 |-+ |
| O OO O OO |
524000 |O+ OO OOO OO OOOO OOOO |
| |
522000 +------------------------------------------------------------------+

will-it-scale.workload

4.32e+06 +----------------------------------------------------------------+
| |
4.3e+06 |++.+++++.+ ++. +++. ++++.++++.+++++.+++++.++ |
| : : ++ + : .++ +.++|
4.28e+06 |-+ ++ :++.+++++ ++ |
| + |
4.26e+06 |-+ |
| |
4.24e+06 |-+ |
| |
4.22e+06 |-+ |
| |
4.2e+06 |-O OOOOO OOOOO OOOO OOO |
|O O O O |
4.18e+06 +----------------------------------------------------------------+

will-it-scale.time.minor_page_faults

2.54e+06 +----------------------------------------------------------------+
|+ .+ + +. + +. |
2.53e+06 |-+ ++++.+ ++.+++++. + + ++++.++ + +++++.++ |
2.52e+06 |-+ : : + : ++.++ +.++|
| ++ +++.+++ ++ |
2.51e+06 |-+ |
| |
2.5e+06 |-+ |
| |
2.49e+06 |-+ |
2.48e+06 |-+ |
| |
2.47e+06 |-O OOOOO OOOOO OOOO OOO |
|O O O O |
2.46e+06 +----------------------------------------------------------------+

[*] bisect-good sample
[O] bisect-bad sample

Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.

Thanks,
Rong Chen

Attachments:

(No filename) (16.47 kB)
config-5.6.0-12710-g5ae8a9d7c84e7 (209.50 kB)
job-script (7.46 kB)
job.yaml (5.15 kB)
reproduce (323.00 B)
Download all attachments

2020-04-22 14:14:07

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [RFC PATCH 4/4] mm: Add PG_zero support

On 4/13/20 11:05 PM, Andrew Morton wrote:
> On Mon, 13 Apr 2020 08:11:59 -0700 Alexander Duyck <[email protected]> wrote:
>
>> In addition, unlike madvising the page away there is a pretty
>> significant performance penalty for having to clear the page a second
>> time when the page is split or merged.
>
> I wonder if there might be an issue with increased memory traffic (and
> increased energy consumption, etc). If a page is zeroed immediately
> before getting data written into it (eg, plain old file write(),
> anonymous pagefault) then we can expect that those 4096 zeroes will be
> in CPU cache and mostly not written back. But if that page was zeroed
> a "long" time ago, the caches will probably have been written back.
> Net result: we go from 4k of memory traffic for a 4k page up to 8k of
> memory traffic?

Heh, I was quite sure that this is not the first time background zeroing is
proposed, so I went to google for it... and found that one BSD kernel actually
removed this functionality in 2016 [1] and this was one of the reasons.

[1]
https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/afd2da4dc9056ea79cdf15e8a9386a3d3998f33e

> Also, the name CONFIG_ZERO_PAGE sounds like it has something to do with
> the long established "zero page". Confusing. CONFIG_PREZERO_PAGE,
> maybe?
>

2020-04-24 00:38:57

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC PATCH 4/4] mm: Add PG_zero support

On Wed, 22 Apr 2020 16:09:00 +0200 Vlastimil Babka <[email protected]> wrote:

> On 4/13/20 11:05 PM, Andrew Morton wrote:
> > On Mon, 13 Apr 2020 08:11:59 -0700 Alexander Duyck <[email protected]> wrote:
> >
> >> In addition, unlike madvising the page away there is a pretty
> >> significant performance penalty for having to clear the page a second
> >> time when the page is split or merged.
> >
> > I wonder if there might be an issue with increased memory traffic (and
> > increased energy consumption, etc). If a page is zeroed immediately
> > before getting data written into it (eg, plain old file write(),
> > anonymous pagefault) then we can expect that those 4096 zeroes will be
> > in CPU cache and mostly not written back. But if that page was zeroed
> > a "long" time ago, the caches will probably have been written back.
> > Net result: we go from 4k of memory traffic for a 4k page up to 8k of
> > memory traffic?
>
> Heh, I was quite sure that this is not the first time background zeroing is
> proposed, so I went to google for it... and found that one BSD kernel actually
> removed this functionality in 2016 [1] and this was one of the reasons.
>
> [1]
> https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/afd2da4dc9056ea79cdf15e8a9386a3d3998f33e

Interesting.

However this:

- Pre-zeroing a page leads to a cold-cache case on-use, forcing the fault
source (e.g. a userland program) to actually get the data from main
memory in its likely immediate use of the faulted page, reducing
performance.

implies that BSD was zeroing with non-temporal stores which bypass the
CPU cache. And which presumably invalidate any part of the target
memory which was already in cache. We wouldn't do it that way so
perhaps the results would differ.

2020-04-24 00:43:58

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [RFC PATCH 4/4] mm: Add PG_zero support

On Thu, Apr 23, 2020 at 05:37:00PM -0700, Andrew Morton wrote:
> On Wed, 22 Apr 2020 16:09:00 +0200 Vlastimil Babka <[email protected]> wrote:
> > Heh, I was quite sure that this is not the first time background zeroing is
> > proposed, so I went to google for it... and found that one BSD kernel actually
> > removed this functionality in 2016 [1] and this was one of the reasons.
> >
> > [1]
> > https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/afd2da4dc9056ea79cdf15e8a9386a3d3998f33e
>
> Interesting.
>
> However this:
>
> - Pre-zeroing a page leads to a cold-cache case on-use, forcing the fault
> source (e.g. a userland program) to actually get the data from main
> memory in its likely immediate use of the faulted page, reducing
> performance.
>
> implies that BSD was zeroing with non-temporal stores which bypass the
> CPU cache. And which presumably invalidate any part of the target
> memory which was already in cache. We wouldn't do it that way so
> perhaps the results would differ.

Or just that the page was zeroed far enough in advance that it fell out
of cache naturally.

I know Arjan looked at zeroing on free instead of zeroing on alloc,
and that didn't get merged (or even submitted afaik), so presumably the
results weren't good.

When I was at Microsoft, there was a usecase that made sense, and that
was virtualisation. If the hypervisor has zeroed the page before giving
it to the guest, then there's no need for the guest to zero it again.
It's already cache hot, and can be given straight to userspace.

2020-04-24 07:33:50

by David Hildenbrand

[permalink] [raw]

Subject: Re: [RFC PATCH 4/4] mm: Add PG_zero support

On 24.04.20 02:41, Matthew Wilcox wrote:
> On Thu, Apr 23, 2020 at 05:37:00PM -0700, Andrew Morton wrote:
>> On Wed, 22 Apr 2020 16:09:00 +0200 Vlastimil Babka <[email protected]> wrote:
>>> Heh, I was quite sure that this is not the first time background zeroing is
>>> proposed, so I went to google for it... and found that one BSD kernel actually
>>> removed this functionality in 2016 [1] and this was one of the reasons.
>>>
>>> [1]
>>> https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/afd2da4dc9056ea79cdf15e8a9386a3d3998f33e
>>
>> Interesting.
>>
>> However this:
>>
>> - Pre-zeroing a page leads to a cold-cache case on-use, forcing the fault
>> source (e.g. a userland program) to actually get the data from main
>> memory in its likely immediate use of the faulted page, reducing
>> performance.
>>
>> implies that BSD was zeroing with non-temporal stores which bypass the
>> CPU cache. And which presumably invalidate any part of the target
>> memory which was already in cache. We wouldn't do it that way so
>> perhaps the results would differ.
>
> Or just that the page was zeroed far enough in advance that it fell out
> of cache naturally.
>
> I know Arjan looked at zeroing on free instead of zeroing on alloc,
> and that didn't get merged (or even submitted afaik), so presumably the
> results weren't good.

We do have INIT_ON_FREE_DEFAULT_ON

via

commit 6471384af2a6530696fc0203bafe4de41a23c9ef
Author: Alexander Potapenko <[email protected]>
Date: Thu Jul 11 20:59:19 2019 -0700

mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options

which seems to do exactly that (although for a different use case)

--
Thanks,

David / dhildenb

2020-04-24 07:59:35

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [RFC PATCH 4/4] mm: Add PG_zero support

On 4/24/20 9:28 AM, David Hildenbrand wrote:
> On 24.04.20 02:41, Matthew Wilcox wrote:
>> On Thu, Apr 23, 2020 at 05:37:00PM -0700, Andrew Morton wrote:
>>> On Wed, 22 Apr 2020 16:09:00 +0200 Vlastimil Babka <[email protected]> wrote:
>>>> Heh, I was quite sure that this is not the first time background zeroing is
>>>> proposed, so I went to google for it... and found that one BSD kernel actually
>>>> removed this functionality in 2016 [1] and this was one of the reasons.
>>>>
>>>> [1]
>>>> https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/afd2da4dc9056ea79cdf15e8a9386a3d3998f33e
>>>
>>> Interesting.
>>>
>>> However this:
>>>
>>> - Pre-zeroing a page leads to a cold-cache case on-use, forcing the fault
>>> source (e.g. a userland program) to actually get the data from main
>>> memory in its likely immediate use of the faulted page, reducing
>>> performance.
>>>
>>> implies that BSD was zeroing with non-temporal stores which bypass the
>>> CPU cache. And which presumably invalidate any part of the target
>>> memory which was already in cache. We wouldn't do it that way so
>>> perhaps the results would differ.
>>
>> Or just that the page was zeroed far enough in advance that it fell out
>> of cache naturally.

Agreed.

>> I know Arjan looked at zeroing on free instead of zeroing on alloc,
>> and that didn't get merged (or even submitted afaik), so presumably the
>> results weren't good.
>
> We do have INIT_ON_FREE_DEFAULT_ON
>
> via
>
> commit 6471384af2a6530696fc0203bafe4de41a23c9ef
> Author: Alexander Potapenko <[email protected]>
> Date: Thu Jul 11 20:59:19 2019 -0700
>
> mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
>
> which seems to do exactly that (although for a different use case)

Yeah, except the security use case wants to do that immediately, while the
proposal here is a background thread.

2020-04-24 08:00:12

by David Hildenbrand

[permalink] [raw]

Subject: Re: [RFC PATCH 4/4] mm: Add PG_zero support

On 24.04.20 09:55, Vlastimil Babka wrote:
> On 4/24/20 9:28 AM, David Hildenbrand wrote:
>> On 24.04.20 02:41, Matthew Wilcox wrote:
>>> On Thu, Apr 23, 2020 at 05:37:00PM -0700, Andrew Morton wrote:
>>>> On Wed, 22 Apr 2020 16:09:00 +0200 Vlastimil Babka <[email protected]> wrote:
>>>>> Heh, I was quite sure that this is not the first time background zeroing is
>>>>> proposed, so I went to google for it... and found that one BSD kernel actually
>>>>> removed this functionality in 2016 [1] and this was one of the reasons.
>>>>>
>>>>> [1]
>>>>> https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/afd2da4dc9056ea79cdf15e8a9386a3d3998f33e
>>>>
>>>> Interesting.
>>>>
>>>> However this:
>>>>
>>>> - Pre-zeroing a page leads to a cold-cache case on-use, forcing the fault
>>>> source (e.g. a userland program) to actually get the data from main
>>>> memory in its likely immediate use of the faulted page, reducing
>>>> performance.
>>>>
>>>> implies that BSD was zeroing with non-temporal stores which bypass the
>>>> CPU cache. And which presumably invalidate any part of the target
>>>> memory which was already in cache. We wouldn't do it that way so
>>>> perhaps the results would differ.
>>>
>>> Or just that the page was zeroed far enough in advance that it fell out
>>> of cache naturally.
>
> Agreed.
>
>>> I know Arjan looked at zeroing on free instead of zeroing on alloc,
>>> and that didn't get merged (or even submitted afaik), so presumably the
>>> results weren't good.
>>
>> We do have INIT_ON_FREE_DEFAULT_ON
>>
>> via
>>
>> commit 6471384af2a6530696fc0203bafe4de41a23c9ef
>> Author: Alexander Potapenko <[email protected]>
>> Date: Thu Jul 11 20:59:19 2019 -0700
>>
>> mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
>>
>> which seems to do exactly that (although for a different use case)
>
> Yeah, except the security use case wants to do that immediately, while the
> proposal here is a background thread.
>

Yes I know, this was just a comment regarding "Arjan looked at zeroing
on free instead of zeroing on alloc".

--
Thanks,

David / dhildenb