2021-01-28 00:43:23

by Saravanan D

[permalink] [raw]
Subject: [PATCH V3] x86/mm: Tracking linear mapping split events

To help with debugging the sluggishness caused by TLB miss/reload,
we introduce monotonic lifetime hugepage split event counts since
system state: SYSTEM_RUNNING to be displayed as part of
/proc/vmstat in x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
....
swap_ra 0
swap_ra_hit 0
direct_map_2M_splits 167
direct_map_1G_splits 6
nr_unstable 0
....

One of the many lasting (as we don't coalesce back) sources for huge page
splits is tracing as the granular page attribute/permission changes would
force the kernel to split code segments mapped to huge pages to smaller
ones thereby increasing the probability of TLB miss/reload even after
tracing has been stopped.

Signed-off-by: Saravanan D <[email protected]>
---
arch/x86/mm/pat/set_memory.c | 18 ++++++++++++++++++
include/linux/vm_event_item.h | 8 ++++++++
mm/vmstat.c | 8 ++++++++
3 files changed, 34 insertions(+)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..3ea6316df089 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,8 @@
#include <linux/pci.h>
#include <linux/vmalloc.h>
#include <linux/libnvdimm.h>
+#include <linux/vmstat.h>
+#include <linux/kernel.h>

#include <asm/e820/api.h>
#include <asm/processor.h>
@@ -85,12 +87,28 @@ void update_page_count(int level, unsigned long pages)
spin_unlock(&pgd_lock);
}

+void update_split_page_event_count(int level)
+{
+ if (system_state == SYSTEM_RUNNING) {
+ if (level == PG_LEVEL_2M) {
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
+ count_vm_event(DIRECT_MAP_2M_SPLIT);
+#else
+ count_vm_event(DIRECT_MAP_4M_SPLIT);
+#endif
+ } else if (level == PG_LEVEL_1G) {
+ count_vm_event(DIRECT_MAP_1G_SPLIT);
+ }
+ }
+}
+
static void split_page_count(int level)
{
if (direct_pages_count[level] == 0)
return;

direct_pages_count[level]--;
+ update_split_page_event_count(level);
direct_pages_count[level - 1] += PTRS_PER_PTE;
}

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..439742d2435e 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -120,6 +120,14 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_SWAP
SWAP_RA,
SWAP_RA_HIT,
+#endif
+#if defined(__x86_64__)
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
+ DIRECT_MAP_2M_SPLIT,
+#else
+ DIRECT_MAP_4M_SPLIT,
+#endif
+ DIRECT_MAP_1G_SPLIT,
#endif
NR_VM_EVENT_ITEMS
};
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f8942160fc95..beaa2bb4f9dc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1350,6 +1350,14 @@ const char * const vmstat_text[] = {
"swap_ra",
"swap_ra_hit",
#endif
+#if defined(__x86_64__)
+#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
+ "direct_map_2M_splits",
+#else
+ "direct_map_4M_splits",
+#endif
+ "direct_map_1G_splits",
+#endif
#endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
};
#endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
--
2.24.1


2021-01-28 00:46:17

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH V3] x86/mm: Tracking linear mapping split events

On 1/27/21 2:50 PM, Saravanan D wrote:
> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic lifetime hugepage split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
>
> The lifetime split event information will be displayed at the bottom of
> /proc/vmstat
> ....
> swap_ra 0
> swap_ra_hit 0
> direct_map_2M_splits 167
> direct_map_1G_splits 6
> nr_unstable 0
> ....
>
> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.
>
> Signed-off-by: Saravanan D <[email protected]>
> ---
> arch/x86/mm/pat/set_memory.c | 18 ++++++++++++++++++
> include/linux/vm_event_item.h | 8 ++++++++
> mm/vmstat.c | 8 ++++++++
> 3 files changed, 34 insertions(+)

Documenation/ update, please.

--
~Randy

2021-01-28 00:48:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH V3] x86/mm: Tracking linear mapping split events

On 1/27/21 2:50 PM, Saravanan D wrote:
> +#if defined(__x86_64__)

We don't use __x86_64__ in the kernel. This should be CONFIG_X86.

> +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
> + "direct_map_2M_splits",
> +#else
> + "direct_map_4M_splits",
> +#endif
> + "direct_map_1G_splits",
> +#endif

These #ifdefs are hideous, and repeated.

I'd rather have no 32-bit support than expose us to this ugliness.
Worst case, the 32-bit non-PAE folks (i.e. almost nobody in the world)
can just live with seeing "2M" when the mappings are really 4M. Or, you
*could* name these after the page table levels:

direct_map_pmd_splits
direct_map_pud_splits

or the level from the bottom where the split occurred:

direct_map_level2_splits
direct_map_level3_splits

That has the bonus of being usable on other architectures.

Oh, and 1G splits aren't possible on non-PAE 32-bit. There are only 2
levels: 4M and 4k, which would make what you have above:

> +#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
> + "direct_map_2M_splits",
> + "direct_map_1G_splits",
> +#else
> + "direct_map_4M_splits",
> +#endif

I don't think there's ever a 1G/4M case.

2021-01-28 00:50:04

by Saravanan D

[permalink] [raw]
Subject: Re: [PATCH V3] x86/mm: Tracking linear mapping split events

Hi Randy,
> Documenation/ update, please.
I will include it in the V4 patch.

- Saravanan D

2021-01-28 00:52:55

by Saravanan D

[permalink] [raw]
Subject: Re: [PATCH V3] x86/mm: Tracking linear mapping split events

Hi Dave,

> We don't use __x86_64__ in the kernel. This should be CONFIG_X86.
Noted. I will correct this in V4

> or the level from the bottom where the split occurred:
>
> direct_map_level2_splits
> direct_map_level3_splits
>
> That has the bonus of being usable on other architectures.
Naming them after page table levels makes lot of sense. 2 new vmstat
event counters that is relevant for all without the need for #ifdef
page size craziness.

- Saravanan D

2021-01-28 04:37:49

by Saravanan D

[permalink] [raw]
Subject: [PATCH V4] x86/mm: Tracking linear mapping split events

To help with debugging the sluggishness caused by TLB miss/reload,
we introduce monotonic lifetime hugepage split event counts since
system state: SYSTEM_RUNNING to be displayed as part of
/proc/vmstat in x86 servers

The lifetime split event information will be displayed at the bottom of
/proc/vmstat
....
swap_ra 0
swap_ra_hit 0
direct_map_level2_splits 94
direct_map_level3_splits 4
nr_unstable 0
....

One of the many lasting (as we don't coalesce back) sources for huge page
splits is tracing as the granular page attribute/permission changes would
force the kernel to split code segments mapped to huge pages to smaller
ones thereby increasing the probability of TLB miss/reload even after
tracing has been stopped.

Documentation regarding linear mapping split events added to admin-guide
as requested in V3 of the patch.

Signed-off-by: Saravanan D <[email protected]>
---
.../admin-guide/mm/direct_mapping_splits.rst | 59 +++++++++++++++++++
Documentation/admin-guide/mm/index.rst | 1 +
arch/x86/mm/pat/set_memory.c | 13 ++++
include/linux/vm_event_item.h | 4 ++
mm/vmstat.c | 4 ++
5 files changed, 81 insertions(+)
create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst

diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst
new file mode 100644
index 000000000000..298751391deb
--- /dev/null
+++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Direct Mapping Splits
+=====================
+
+Kernel maps all of physical memory in linear/direct mapped pages with
+translation of virtual kernel address to physical address is achieved
+through a simple subtraction of offset. CPUs maintain a cache of these
+translations on fast caches called TLBs. CPU architectures like x86 allow
+direct mapping large portions of memory into hugepages (2M, 1G, etc) in
+various page table levels.
+
+Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
+The splintering of huge direct pages into smaller ones does result in
+a measurable performance hit caused by frequent TLB miss and reloads.
+
+One of the many lasting (as we don't coalesce back) sources for huge page
+splits is tracing as the granular page attribute/permission changes would
+force the kernel to split code segments mapped to hugepages to smaller
+ones thus increasing the probability of TLB miss/reloads even after
+tracing has been stopped.
+
+On x86 systems, we can track the splitting of huge direct mapped pages
+through lifetime event counters in ``/proc/vmstat``
+
+ direct_map_level2_splits xxx
+ direct_map_level3_splits yyy
+
+where:
+
+direct_map_level2_splits
+ are 2M/4M hugepage split events
+direct_map_level3_splits
+ are 1G hugepage split events
+
+The distribution of direct mapped system memory in various page sizes
+post splits can be viewed through ``/proc/meminfo`` whose output
+will include the following lines depending upon supporting CPU
+architecture
+
+ DirectMap4k: xxxxx kB
+ DirectMap2M: yyyyy kB
+ DirectMap1G: zzzzz kB
+
+where:
+
+DirectMap4k
+ is the total amount of direct mapped memory (in kB)
+ accessed through 4k pages
+DirectMap2M
+ is the total amount of direct mapped memory (in kB)
+ accessed through 2M pages
+DirectMap1G
+ is the total amount of direct mapped memory (in kB)
+ accessed through 1G pages
+
+
+-- Saravanan D, Jan 27, 2021
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 4b14d8b50e9e..9439780f3f07 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -38,3 +38,4 @@ the Linux memory management.
soft-dirty
transhuge
userfaultfd
+ direct_mapping_splits
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..767cade53bdc 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -16,6 +16,8 @@
#include <linux/pci.h>
#include <linux/vmalloc.h>
#include <linux/libnvdimm.h>
+#include <linux/vmstat.h>
+#include <linux/kernel.h>

#include <asm/e820/api.h>
#include <asm/processor.h>
@@ -85,12 +87,23 @@ void update_page_count(int level, unsigned long pages)
spin_unlock(&pgd_lock);
}

+void update_split_page_event_count(int level)
+{
+ if (system_state == SYSTEM_RUNNING) {
+ if (level == PG_LEVEL_2M)
+ count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
+ else if (level == PG_LEVEL_1G)
+ count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
+ }
+}
+
static void split_page_count(int level)
{
if (direct_pages_count[level] == 0)
return;

direct_pages_count[level]--;
+ update_split_page_event_count(level);
direct_pages_count[level - 1] += PTRS_PER_PTE;
}

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 18e75974d4e3..7c06c2bdc33b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_SWAP
SWAP_RA,
SWAP_RA_HIT,
+#endif
+#ifdef CONFIG_X86
+ DIRECT_MAP_LEVEL2_SPLIT,
+ DIRECT_MAP_LEVEL3_SPLIT,
#endif
NR_VM_EVENT_ITEMS
};
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f8942160fc95..a43ac4ac98a2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
"swap_ra",
"swap_ra_hit",
#endif
+#ifdef CONFIG_X86
+ "direct_map_level2_splits",
+ "direct_map_level3_splits",
+#endif
#endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
};
#endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
--
2.24.1

2021-01-28 04:55:01

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH V4] x86/mm: Tracking linear mapping split events

You forgot to cc linux-mm. Adding. Also I think you should be cc'ing
Song.

On Wed, Jan 27, 2021 at 08:35:47PM -0800, Saravanan D wrote:
> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic lifetime hugepage split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
>
> The lifetime split event information will be displayed at the bottom of
> /proc/vmstat
> ....
> swap_ra 0
> swap_ra_hit 0
> direct_map_level2_splits 94
> direct_map_level3_splits 4
> nr_unstable 0
> ....
>
> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.

Are you talking about kernel text here or application text?

In either case, I don't know why you're saying we don't coalesce
back after tracing is disabled. I was under the impression we did
(either actively in the case of the kernel or via khugepaged for
user text).

> Documentation regarding linear mapping split events added to admin-guide
> as requested in V3 of the patch.
>
> Signed-off-by: Saravanan D <[email protected]>
> ---
> .../admin-guide/mm/direct_mapping_splits.rst | 59 +++++++++++++++++++
> Documentation/admin-guide/mm/index.rst | 1 +
> arch/x86/mm/pat/set_memory.c | 13 ++++
> include/linux/vm_event_item.h | 4 ++
> mm/vmstat.c | 4 ++
> 5 files changed, 81 insertions(+)
> create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst
>
> diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> new file mode 100644
> index 000000000000..298751391deb
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> @@ -0,0 +1,59 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Direct Mapping Splits
> +=====================
> +
> +Kernel maps all of physical memory in linear/direct mapped pages with
> +translation of virtual kernel address to physical address is achieved
> +through a simple subtraction of offset. CPUs maintain a cache of these
> +translations on fast caches called TLBs. CPU architectures like x86 allow
> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in
> +various page table levels.
> +
> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
> +The splintering of huge direct pages into smaller ones does result in
> +a measurable performance hit caused by frequent TLB miss and reloads.
> +
> +One of the many lasting (as we don't coalesce back) sources for huge page
> +splits is tracing as the granular page attribute/permission changes would
> +force the kernel to split code segments mapped to hugepages to smaller
> +ones thus increasing the probability of TLB miss/reloads even after
> +tracing has been stopped.
> +
> +On x86 systems, we can track the splitting of huge direct mapped pages
> +through lifetime event counters in ``/proc/vmstat``
> +
> + direct_map_level2_splits xxx
> + direct_map_level3_splits yyy
> +
> +where:
> +
> +direct_map_level2_splits
> + are 2M/4M hugepage split events
> +direct_map_level3_splits
> + are 1G hugepage split events
> +
> +The distribution of direct mapped system memory in various page sizes
> +post splits can be viewed through ``/proc/meminfo`` whose output
> +will include the following lines depending upon supporting CPU
> +architecture
> +
> + DirectMap4k: xxxxx kB
> + DirectMap2M: yyyyy kB
> + DirectMap1G: zzzzz kB
> +
> +where:
> +
> +DirectMap4k
> + is the total amount of direct mapped memory (in kB)
> + accessed through 4k pages
> +DirectMap2M
> + is the total amount of direct mapped memory (in kB)
> + accessed through 2M pages
> +DirectMap1G
> + is the total amount of direct mapped memory (in kB)
> + accessed through 1G pages
> +
> +
> +-- Saravanan D, Jan 27, 2021
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index 4b14d8b50e9e..9439780f3f07 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -38,3 +38,4 @@ the Linux memory management.
> soft-dirty
> transhuge
> userfaultfd
> + direct_mapping_splits
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 16f878c26667..767cade53bdc 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -16,6 +16,8 @@
> #include <linux/pci.h>
> #include <linux/vmalloc.h>
> #include <linux/libnvdimm.h>
> +#include <linux/vmstat.h>
> +#include <linux/kernel.h>
>
> #include <asm/e820/api.h>
> #include <asm/processor.h>
> @@ -85,12 +87,23 @@ void update_page_count(int level, unsigned long pages)
> spin_unlock(&pgd_lock);
> }
>
> +void update_split_page_event_count(int level)
> +{
> + if (system_state == SYSTEM_RUNNING) {
> + if (level == PG_LEVEL_2M)
> + count_vm_event(DIRECT_MAP_LEVEL2_SPLIT);
> + else if (level == PG_LEVEL_1G)
> + count_vm_event(DIRECT_MAP_LEVEL3_SPLIT);
> + }
> +}
> +
> static void split_page_count(int level)
> {
> if (direct_pages_count[level] == 0)
> return;
>
> direct_pages_count[level]--;
> + update_split_page_event_count(level);
> direct_pages_count[level - 1] += PTRS_PER_PTE;
> }
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 18e75974d4e3..7c06c2bdc33b 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> #ifdef CONFIG_SWAP
> SWAP_RA,
> SWAP_RA_HIT,
> +#endif
> +#ifdef CONFIG_X86
> + DIRECT_MAP_LEVEL2_SPLIT,
> + DIRECT_MAP_LEVEL3_SPLIT,
> #endif
> NR_VM_EVENT_ITEMS
> };
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index f8942160fc95..a43ac4ac98a2 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1350,6 +1350,10 @@ const char * const vmstat_text[] = {
> "swap_ra",
> "swap_ra_hit",
> #endif
> +#ifdef CONFIG_X86
> + "direct_map_level2_splits",
> + "direct_map_level3_splits",
> +#endif
> #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
> };
> #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
> --
> 2.24.1
>