2020-04-03 11:32:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 0/8] thp/khugepaged improvements and CoW semantics

The patchset adds khugepaged selftest (anon-THP only for now), expands
cases khugepaged can handle and switches anon-THP copy-on-write handling
to 4k.

Please review and consider applying.

v2:
- Fix race in compound page handling;
- Add one more test-case for compound page case;
- Rework LRU add cache draining;
- Typos;

Kirill A. Shutemov (8):
khugepaged: Add self test
khugepaged: Do not stop collapse if less than half PTEs are referenced
khugepaged: Drain all LRU caches before scanning pages
khugepaged: Drain LRU add pagevec after swapin
khugepaged: Allow to callapse a page shared across fork
khugepaged: Allow to collapse PTE-mapped compound pages
thp: Change CoW semantics for anon-THP
khugepaged: Introduce 'max_ptes_shared' tunable

Documentation/admin-guide/mm/transhuge.rst | 7 +
include/trace/events/huge_memory.h | 3 +-
mm/huge_memory.c | 249 +-----
mm/khugepaged.c | 184 +++-
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/khugepaged.c | 982 +++++++++++++++++++++
6 files changed, 1153 insertions(+), 273 deletions(-)
create mode 100644 tools/testing/selftests/vm/khugepaged.c

--
2.26.0


2020-04-03 12:29:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 1/8] khugepaged: Add self test

The test checks if khugepaged is able to recover huge page where we
expetect to do so. It only covers anon-THP for now.

Currenlty the test shows few failures. They are going to be addressed by
the following patches.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
tools/testing/selftests/vm/Makefile | 1 +
tools/testing/selftests/vm/khugepaged.c | 899 ++++++++++++++++++++++++
2 files changed, 900 insertions(+)
create mode 100644 tools/testing/selftests/vm/khugepaged.c

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 7f9a8a8c31da..981d0dc21f9e 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -18,6 +18,7 @@ TEST_GEN_FILES += on-fault-limit
TEST_GEN_FILES += thuge-gen
TEST_GEN_FILES += transhuge-stress
TEST_GEN_FILES += userfaultfd
+TEST_GEN_FILES += khugepaged

ifneq (,$(filter $(ARCH),arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x sh64 sparc64 x86_64))
TEST_GEN_FILES += va_128TBswitch
diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
new file mode 100644
index 000000000000..3eeff4a0fbc9
--- /dev/null
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -0,0 +1,899 @@
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <limits.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <sys/mman.h>
+#include <sys/wait.h>
+
+#ifndef MADV_PAGEOUT
+#define MADV_PAGEOUT 21
+#endif
+
+#define BASE_ADDR ((void *)(1UL << 30))
+static unsigned long hpage_pmd_size;
+static unsigned long page_size;
+static int hpage_pmd_nr;
+
+#define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/"
+
+enum thp_enabled {
+ THP_ALWAYS,
+ THP_MADVISE,
+ THP_NEVER,
+};
+
+static const char *thp_enabled_strings[] = {
+ "always",
+ "madvise",
+ "never",
+ NULL
+};
+
+enum thp_defrag {
+ THP_DEFRAG_ALWAYS,
+ THP_DEFRAG_DEFER,
+ THP_DEFRAG_DEFER_MADVISE,
+ THP_DEFRAG_MADVISE,
+ THP_DEFRAG_NEVER,
+};
+
+static const char *thp_defrag_strings[] = {
+ "always",
+ "defer",
+ "defer+madvise",
+ "madvise",
+ "never",
+ NULL
+};
+
+enum shmem_enabled {
+ SHMEM_ALWAYS,
+ SHMEM_WITHIN_SIZE,
+ SHMEM_ADVISE,
+ SHMEM_NEVER,
+ SHMEM_DENY,
+ SHMEM_FORCE,
+};
+
+static const char *shmem_enabled_strings[] = {
+ "always",
+ "within_size",
+ "advise",
+ "never",
+ "deny",
+ "force",
+ NULL
+};
+
+struct khugepaged_settings {
+ bool defrag;
+ unsigned int alloc_sleep_millisecs;
+ unsigned int scan_sleep_millisecs;
+ unsigned int max_ptes_none;
+ unsigned int max_ptes_swap;
+ unsigned long pages_to_scan;
+};
+
+struct settings {
+ enum thp_enabled thp_enabled;
+ enum thp_defrag thp_defrag;
+ enum shmem_enabled shmem_enabled;
+ bool debug_cow;
+ bool use_zero_page;
+ struct khugepaged_settings khugepaged;
+};
+
+static struct settings default_settings = {
+ .thp_enabled = THP_MADVISE,
+ .thp_defrag = THP_DEFRAG_ALWAYS,
+ .shmem_enabled = SHMEM_NEVER,
+ .debug_cow = 0,
+ .use_zero_page = 0,
+ .khugepaged = {
+ .defrag = 1,
+ .alloc_sleep_millisecs = 10,
+ .scan_sleep_millisecs = 10,
+ },
+};
+
+static struct settings saved_settings;
+static bool skip_settings_restore;
+
+static int exit_status;
+
+static void success(const char *msg)
+{
+ printf(" \e[32m%s\e[0m\n", msg);
+}
+
+static void fail(const char *msg)
+{
+ printf(" \e[31m%s\e[0m\n", msg);
+ exit_status++;
+}
+
+static int read_file(const char *path, char *buf, size_t buflen)
+{
+ int fd;
+ ssize_t numread;
+
+ fd = open(path, O_RDONLY);
+ if (fd == -1)
+ return 0;
+
+ numread = read(fd, buf, buflen - 1);
+ if (numread < 1) {
+ close(fd);
+ return 0;
+ }
+
+ buf[numread] = '\0';
+ close(fd);
+
+ return (unsigned int) numread;
+}
+
+static int write_file(const char *path, const char *buf, size_t buflen)
+{
+ int fd;
+ ssize_t numwritten;
+
+ fd = open(path, O_WRONLY);
+ if (fd == -1)
+ return 0;
+
+ numwritten = write(fd, buf, buflen - 1);
+ close(fd);
+ if (numwritten < 1)
+ return 0;
+
+ return (unsigned int) numwritten;
+}
+
+static int read_string(const char *name, const char *strings[])
+{
+ char path[PATH_MAX];
+ char buf[256];
+ char *c;
+ int ret;
+
+ ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
+ if (ret >= PATH_MAX) {
+ printf("%s: Pathname is too long\n", __func__);
+ exit(EXIT_FAILURE);
+ }
+
+ if (!read_file(path, buf, sizeof(buf))) {
+ perror(path);
+ exit(EXIT_FAILURE);
+ }
+
+ c = strchr(buf, '[');
+ if (!c) {
+ printf("%s: Parse failure\n", __func__);
+ exit(EXIT_FAILURE);
+ }
+
+ c++;
+ memmove(buf, c, sizeof(buf) - (c - buf));
+
+ c = strchr(buf, ']');
+ if (!c) {
+ printf("%s: Parse failure\n", __func__);
+ exit(EXIT_FAILURE);
+ }
+ *c = '\0';
+
+ ret = 0;
+ while (strings[ret]) {
+ if (!strcmp(strings[ret], buf))
+ return ret;
+ ret++;
+ }
+
+ printf("Failed to parse %s\n", name);
+ exit(EXIT_FAILURE);
+}
+
+static void write_string(const char *name, const char *val)
+{
+ char path[PATH_MAX];
+ int ret;
+
+ ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
+ if (ret >= PATH_MAX) {
+ printf("%s: Pathname is too long\n", __func__);
+ exit(EXIT_FAILURE);
+ }
+
+ if (!write_file(path, val, strlen(val) + 1)) {
+ perror(path);
+ exit(EXIT_FAILURE);
+ }
+}
+
+static const unsigned long read_num(const char *name)
+{
+ char path[PATH_MAX];
+ char buf[21];
+ int ret;
+
+ ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
+ if (ret >= PATH_MAX) {
+ printf("%s: Pathname is too long\n", __func__);
+ exit(EXIT_FAILURE);
+ }
+
+ ret = read_file(path, buf, sizeof(buf));
+ if (ret < 0) {
+ perror("read_file(read_num)");
+ exit(EXIT_FAILURE);
+ }
+
+ return strtoul(buf, NULL, 10);
+}
+
+static void write_num(const char *name, unsigned long num)
+{
+ char path[PATH_MAX];
+ char buf[21];
+ int ret;
+
+ ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
+ if (ret >= PATH_MAX) {
+ printf("%s: Pathname is too long\n", __func__);
+ exit(EXIT_FAILURE);
+ }
+
+ sprintf(buf, "%ld", num);
+ if (!write_file(path, buf, strlen(buf) + 1)) {
+ perror(path);
+ exit(EXIT_FAILURE);
+ }
+}
+
+static void write_settings(struct settings *settings)
+{
+ struct khugepaged_settings *khugepaged = &settings->khugepaged;
+
+ write_string("enabled", thp_enabled_strings[settings->thp_enabled]);
+ write_string("defrag", thp_defrag_strings[settings->thp_defrag]);
+ write_string("shmem_enabled",
+ shmem_enabled_strings[settings->shmem_enabled]);
+ write_num("debug_cow", settings->debug_cow);
+ write_num("use_zero_page", settings->use_zero_page);
+
+ write_num("khugepaged/defrag", khugepaged->defrag);
+ write_num("khugepaged/alloc_sleep_millisecs",
+ khugepaged->alloc_sleep_millisecs);
+ write_num("khugepaged/scan_sleep_millisecs",
+ khugepaged->scan_sleep_millisecs);
+ write_num("khugepaged/max_ptes_none", khugepaged->max_ptes_none);
+ write_num("khugepaged/max_ptes_swap", khugepaged->max_ptes_swap);
+ write_num("khugepaged/pages_to_scan", khugepaged->pages_to_scan);
+}
+
+static void restore_settings(int sig)
+{
+ if (skip_settings_restore)
+ goto out;
+
+ printf("Restore THP and khugepaged settings...");
+ write_settings(&saved_settings);
+ success("OK");
+ if (sig)
+ exit(EXIT_FAILURE);
+out:
+ exit(exit_status);
+}
+
+static void save_settings(void)
+{
+ printf("Save THP and khugepaged settings...");
+ saved_settings = (struct settings) {
+ .thp_enabled = read_string("enabled", thp_enabled_strings),
+ .thp_defrag = read_string("defrag", thp_defrag_strings),
+ .shmem_enabled =
+ read_string("shmem_enabled", shmem_enabled_strings),
+ .debug_cow = read_num("debug_cow"),
+ .use_zero_page = read_num("use_zero_page"),
+ };
+ saved_settings.khugepaged = (struct khugepaged_settings) {
+ .defrag = read_num("khugepaged/defrag"),
+ .alloc_sleep_millisecs =
+ read_num("khugepaged/alloc_sleep_millisecs"),
+ .scan_sleep_millisecs =
+ read_num("khugepaged/scan_sleep_millisecs"),
+ .max_ptes_none = read_num("khugepaged/max_ptes_none"),
+ .max_ptes_swap = read_num("khugepaged/max_ptes_swap"),
+ .pages_to_scan = read_num("khugepaged/pages_to_scan"),
+ };
+ success("OK");
+
+ signal(SIGTERM, restore_settings);
+ signal(SIGINT, restore_settings);
+ signal(SIGHUP, restore_settings);
+ signal(SIGQUIT, restore_settings);
+}
+
+static void adjust_settings(void)
+{
+
+ printf("Adjust settings...");
+ write_settings(&default_settings);
+ success("OK");
+}
+
+#define CHECK_HUGE_FMT "sed -ne " \
+ "'/^%lx/,/^AnonHugePages/{/^AnonHugePages:\\s*%ld kB/ q1}' " \
+ "/proc/%d/smaps"
+
+static bool check_huge(void *p)
+{
+ char *cmd;
+ int ret;
+
+ ret = asprintf(&cmd, CHECK_HUGE_FMT,
+ (unsigned long)p, hpage_pmd_size >> 10, getpid());
+ if (ret < 0) {
+ perror("asprintf(CHECK_FMT)");
+ exit(EXIT_FAILURE);
+ }
+
+ ret = system(cmd);
+ free(cmd);
+ if (ret < 0 || !WIFEXITED(ret)) {
+ perror("system(check_huge)");
+ exit(EXIT_FAILURE);
+ }
+
+ return WEXITSTATUS(ret);
+}
+
+#define CHECK_SWAP_FMT "sed -ne " \
+ "'/^%lx/,/^Swap:/{/^Swap:\\s*%ld kB/ q1}' " \
+ "/proc/%d/smaps"
+
+static bool check_swap(void *p, unsigned long size)
+{
+ char *cmd;
+ int ret;
+
+ ret = asprintf(&cmd, CHECK_SWAP_FMT,
+ (unsigned long)p, size >> 10, getpid());
+ if (ret < 0) {
+ perror("asprintf(CHECK_SWAP)");
+ exit(EXIT_FAILURE);
+ }
+
+ ret = system(cmd);
+ free(cmd);
+ if (ret < 0 || !WIFEXITED(ret)) {
+ perror("system(check_swap)");
+ exit(EXIT_FAILURE);
+ }
+
+ return WEXITSTATUS(ret);
+}
+
+static void *alloc_mapping(void)
+{
+ void *p;
+
+ p = mmap(BASE_ADDR, hpage_pmd_size, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (p != BASE_ADDR) {
+ printf("Failed to allocate VMA at %p\n", BASE_ADDR);
+ exit(EXIT_FAILURE);
+ }
+
+ return p;
+}
+
+static void fill_memory(int *p, unsigned long start, unsigned long end)
+{
+ int i;
+
+ for (i = start / page_size; i < end / page_size; i++)
+ p[i * page_size / sizeof(*p)] = i + 0xdead0000;
+}
+
+static void validate_memory(int *p, unsigned long start, unsigned long end)
+{
+ int i;
+
+ for (i = start / page_size; i < end / page_size; i++) {
+ if (p[i * page_size / sizeof(*p)] != i + 0xdead0000) {
+ printf("Page %d is corrupted: %#x\n",
+ i, p[i * page_size / sizeof(*p)]);
+ exit(EXIT_FAILURE);
+ }
+ }
+}
+
+#define TICK 500000
+static bool wait_for_scan(const char *msg, char *p)
+{
+ int full_scans;
+ int timeout = 6; /* 3 seconds */
+
+ /* Sanity check */
+ if (check_huge(p)) {
+ printf("Unexpected huge page\n");
+ exit(EXIT_FAILURE);
+ }
+
+ madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
+
+ /* Wait until the second full_scan completed */
+ full_scans = read_num("khugepaged/full_scans") + 2;
+
+ printf("%s...", msg);
+ while (timeout--) {
+ if (check_huge(p))
+ break;
+ if (read_num("khugepaged/full_scans") >= full_scans)
+ break;
+ printf(".");
+ usleep(TICK);
+ }
+
+ madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
+
+ return !timeout;
+}
+
+static void alloc_at_fault(void)
+{
+ struct settings settings = default_settings;
+ char *p;
+
+ settings.thp_enabled = THP_ALWAYS;
+ write_settings(&settings);
+
+ p = alloc_mapping();
+ *p = 1;
+ printf("Allocate huge page on fault...");
+ if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ write_settings(&default_settings);
+
+ madvise(p, page_size, MADV_DONTNEED);
+ printf("Split huge PMD on MADV_DONTNEED...");
+ if (!check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+ munmap(p, hpage_pmd_size);
+}
+
+static void collapse_full(void)
+{
+ void *p;
+
+ p = alloc_mapping();
+ fill_memory(p, 0, hpage_pmd_size);
+ if (wait_for_scan("Collapse fully populated PTE table", p))
+ fail("Timeout");
+ else if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+ validate_memory(p, 0, hpage_pmd_size);
+ munmap(p, hpage_pmd_size);
+}
+
+static void collapse_empty(void)
+{
+ void *p;
+
+ p = alloc_mapping();
+ if (wait_for_scan("Do not collapse empty PTE table", p))
+ fail("Timeout");
+ else if (check_huge(p))
+ fail("Fail");
+ else
+ success("OK");
+ munmap(p, hpage_pmd_size);
+}
+
+static void collapse_signle_pte_entry(void)
+{
+ void *p;
+
+ p = alloc_mapping();
+ fill_memory(p, 0, page_size);
+ if (wait_for_scan("Collapse PTE table with single PTE entry present", p))
+ fail("Timeout");
+ else if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+ validate_memory(p, 0, page_size);
+ munmap(p, hpage_pmd_size);
+}
+
+static void collapse_max_ptes_none(void)
+{
+ int max_ptes_none = hpage_pmd_nr / 2;
+ struct settings settings = default_settings;
+ void *p;
+
+ settings.khugepaged.max_ptes_none = max_ptes_none;
+ write_settings(&settings);
+
+ p = alloc_mapping();
+
+ fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
+ if (wait_for_scan("Do not collapse with max_ptes_none exeeded", p))
+ fail("Timeout");
+ else if (check_huge(p))
+ fail("Fail");
+ else
+ success("OK");
+ validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
+
+ fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
+ if (wait_for_scan("Collapse with max_ptes_none PTEs empty", p))
+ fail("Timeout");
+ else if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+ validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
+
+ munmap(p, hpage_pmd_size);
+ write_settings(&default_settings);
+}
+
+static void collapse_swapin_single_pte(void)
+{
+ void *p;
+ p = alloc_mapping();
+ fill_memory(p, 0, hpage_pmd_size);
+
+ printf("Swapout one page...");
+ if (madvise(p, page_size, MADV_PAGEOUT)) {
+ perror("madvise(MADV_PAGEOUT)");
+ exit(EXIT_FAILURE);
+ }
+ if (check_swap(p, page_size)) {
+ success("OK");
+ } else {
+ fail("Fail");
+ goto out;
+ }
+
+ if (wait_for_scan("Collapse with swaping in single PTE entry", p))
+ fail("Timeout");
+ else if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+ validate_memory(p, 0, hpage_pmd_size);
+out:
+ munmap(p, hpage_pmd_size);
+}
+
+static void collapse_max_ptes_swap(void)
+{
+ int max_ptes_swap = read_num("khugepaged/max_ptes_swap");
+ void *p;
+
+ p = alloc_mapping();
+
+ fill_memory(p, 0, hpage_pmd_size);
+ printf("Swapout %d of %d pages...", max_ptes_swap + 1, hpage_pmd_nr);
+ if (madvise(p, (max_ptes_swap + 1) * page_size, MADV_PAGEOUT)) {
+ perror("madvise(MADV_PAGEOUT)");
+ exit(EXIT_FAILURE);
+ }
+ if (check_swap(p, (max_ptes_swap + 1) * page_size)) {
+ success("OK");
+ } else {
+ fail("Fail");
+ goto out;
+ }
+
+ if (wait_for_scan("Do not collapse with max_ptes_swap exeeded", p))
+ fail("Timeout");
+ else if (check_huge(p))
+ fail("Fail");
+ else
+ success("OK");
+ validate_memory(p, 0, hpage_pmd_size);
+
+ fill_memory(p, 0, hpage_pmd_size);
+ printf("Swapout %d of %d pages...", max_ptes_swap, hpage_pmd_nr);
+ if (madvise(p, max_ptes_swap * page_size, MADV_PAGEOUT)) {
+ perror("madvise(MADV_PAGEOUT)");
+ exit(EXIT_FAILURE);
+ }
+ if (check_swap(p, max_ptes_swap * page_size)) {
+ success("OK");
+ } else {
+ fail("Fail");
+ goto out;
+ }
+
+ if (wait_for_scan("Collapse with max_ptes_swap pages swapped out", p))
+ fail("Timeout");
+ else if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+ validate_memory(p, 0, hpage_pmd_size);
+out:
+ munmap(p, hpage_pmd_size);
+}
+
+static void collapse_signle_pte_entry_compound(void)
+{
+ void *p;
+
+ p = alloc_mapping();
+
+ printf("Allocate huge page...");
+ madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
+ fill_memory(p, 0, hpage_pmd_size);
+ if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+ madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
+
+ printf("Split huge page leaving single PTE mapping compount page...");
+ madvise(p + page_size, hpage_pmd_size - page_size, MADV_DONTNEED);
+ if (!check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ if (wait_for_scan("Collapse PTE table with single PTE mapping compount page", p))
+ fail("Timeout");
+ else if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+ validate_memory(p, 0, page_size);
+ munmap(p, hpage_pmd_size);
+}
+
+static void collapse_full_of_compound(void)
+{
+ void *p;
+
+ p = alloc_mapping();
+
+ printf("Allocate huge page...");
+ madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
+ fill_memory(p, 0, hpage_pmd_size);
+ if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ printf("Split huge page leaving single PTE page table full of compount pages...");
+ madvise(p, page_size, MADV_NOHUGEPAGE);
+ madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
+ if (!check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ if (wait_for_scan("Collapse PTE table full of compound pages", p))
+ fail("Timeout");
+ else if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+ validate_memory(p, 0, hpage_pmd_size);
+ munmap(p, hpage_pmd_size);
+}
+
+static void collapse_compound_extreme(void)
+{
+ void *p;
+ int i;
+
+ p = alloc_mapping();
+ for (i = 0; i < hpage_pmd_nr; i++) {
+ printf("\rConstruct PTE page table full of different PTE-mapped compound pages %3d/%d...",
+ i + 1, hpage_pmd_nr);
+
+ madvise(BASE_ADDR, hpage_pmd_size, MADV_HUGEPAGE);
+ fill_memory(BASE_ADDR, 0, hpage_pmd_size);
+ if (!check_huge(BASE_ADDR)) {
+ printf("Failed to allocate huge page\n");
+ exit(EXIT_FAILURE);
+ }
+ madvise(BASE_ADDR, hpage_pmd_size, MADV_NOHUGEPAGE);
+
+ p = mremap(BASE_ADDR - i * page_size,
+ i * page_size + hpage_pmd_size,
+ (i + 1) * page_size,
+ MREMAP_MAYMOVE | MREMAP_FIXED,
+ BASE_ADDR + 2 * hpage_pmd_size);
+ if (p == MAP_FAILED) {
+ perror("mremap+unmap");
+ exit(EXIT_FAILURE);
+ }
+
+ p = mremap(BASE_ADDR + 2 * hpage_pmd_size,
+ (i + 1) * page_size,
+ (i + 1) * page_size + hpage_pmd_size,
+ MREMAP_MAYMOVE | MREMAP_FIXED,
+ BASE_ADDR - (i + 1) * page_size);
+ if (p == MAP_FAILED) {
+ perror("mremap+alloc");
+ exit(EXIT_FAILURE);
+ }
+ }
+
+ munmap(BASE_ADDR, hpage_pmd_size);
+ fill_memory(p, 0, hpage_pmd_size);
+ if (!check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ if (wait_for_scan("Collapse PTE table full of different compound pages", p))
+ fail("Timeout");
+ else if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ validate_memory(p, 0, hpage_pmd_size);
+ munmap(p, hpage_pmd_size);
+}
+
+static void collapse_fork(void)
+{
+ int wstatus;
+ void *p;
+
+ p = alloc_mapping();
+
+ printf("Allocate small page...");
+ fill_memory(p, 0, page_size);
+ if (!check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ printf("Share small page over fork()...");
+ if (!fork()) {
+ /* Do not touch settings on child exit */
+ skip_settings_restore = true;
+ exit_status = 0;
+
+ if (!check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ fill_memory(p, page_size, 2 * page_size);
+
+ if (wait_for_scan("Collapse PTE table with single page shared with parent process", p))
+ fail("Timeout");
+ else if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ validate_memory(p, 0, page_size);
+ munmap(p, hpage_pmd_size);
+ exit(exit_status);
+ }
+
+ wait(&wstatus);
+ exit_status += WEXITSTATUS(wstatus);
+
+ printf("Check if parent still has small page...");
+ if (!check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+ validate_memory(p, 0, page_size);
+ munmap(p, hpage_pmd_size);
+}
+
+static void collapse_fork_compound(void)
+{
+ int wstatus;
+ void *p;
+
+ p = alloc_mapping();
+
+ printf("Allocate huge page...");
+ madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
+ fill_memory(p, 0, hpage_pmd_size);
+ if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ printf("Share huge page over fork()...");
+ if (!fork()) {
+ /* Do not touch settings on child exit */
+ skip_settings_restore = true;
+ exit_status = 0;
+
+ if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ printf("Split huge page PMD in child process...");
+ madvise(p, page_size, MADV_NOHUGEPAGE);
+ madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
+ if (!check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+ fill_memory(p, 0, page_size);
+
+ if (wait_for_scan("Collapse PTE table full of compound pages in child", p))
+ fail("Timeout");
+ else if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ validate_memory(p, 0, hpage_pmd_size);
+ munmap(p, hpage_pmd_size);
+ exit(exit_status);
+ }
+
+ wait(&wstatus);
+ exit_status += WEXITSTATUS(wstatus);
+
+ printf("Check if parent still has huge page...");
+ if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+ validate_memory(p, 0, hpage_pmd_size);
+ munmap(p, hpage_pmd_size);
+}
+
+int main(void)
+{
+ setbuf(stdout, NULL);
+
+ page_size = getpagesize();
+ hpage_pmd_size = read_num("hpage_pmd_size");
+ hpage_pmd_nr = hpage_pmd_size / page_size;
+
+ default_settings.khugepaged.max_ptes_none = hpage_pmd_nr - 1;
+ default_settings.khugepaged.max_ptes_swap = hpage_pmd_nr / 8;
+ default_settings.khugepaged.pages_to_scan = hpage_pmd_nr * 8;
+
+ save_settings();
+ adjust_settings();
+
+ alloc_at_fault();
+ collapse_full();
+ collapse_empty();
+ collapse_signle_pte_entry();
+ collapse_max_ptes_none();
+ collapse_swapin_single_pte();
+ collapse_max_ptes_swap();
+ collapse_signle_pte_entry_compound();
+ collapse_full_of_compound();
+ collapse_compound_extreme();
+ collapse_fork();
+ collapse_fork_compound();
+
+ restore_settings(0);
+}
--
2.26.0

2020-04-03 12:29:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 6/8] khugepaged: Allow to collapse PTE-mapped compound pages

We can collapse PTE-mapped compound pages. We only need to avoid
handling them more than once: lock/unlock page only once if it's present
in the PMD range multiple times as it handled on compound level. The
same goes for LRU isolation and putback.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/khugepaged.c | 103 ++++++++++++++++++++++++++++++++----------------
1 file changed, 68 insertions(+), 35 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1e7e6543ebca..49e56e4e30d1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -515,23 +515,37 @@ void __khugepaged_exit(struct mm_struct *mm)

static void release_pte_page(struct page *page)
{
- dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
+ mod_node_page_state(page_pgdat(page),
+ NR_ISOLATED_ANON + page_is_file_cache(page),
+ -compound_nr(page));
unlock_page(page);
putback_lru_page(page);
}

-static void release_pte_pages(pte_t *pte, pte_t *_pte)
+static void release_pte_pages(pte_t *pte, pte_t *_pte,
+ struct list_head *compound_pagelist)
{
+ struct page *page, *tmp;
+
while (--_pte >= pte) {
pte_t pteval = *_pte;
- if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)))
- release_pte_page(pte_page(pteval));
+
+ page = pte_page(pteval);
+ if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)) &&
+ !PageCompound(page))
+ release_pte_page(page);
+ }
+
+ list_for_each_entry_safe(page, tmp, compound_pagelist, lru) {
+ list_del(&page->lru);
+ release_pte_page(page);
}
}

static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
unsigned long address,
- pte_t *pte)
+ pte_t *pte,
+ struct list_head *compound_pagelist)
{
struct page *page = NULL;
pte_t *_pte;
@@ -561,13 +575,21 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
goto out;
}

- /* TODO: teach khugepaged to collapse THP mapped with pte */
+ VM_BUG_ON_PAGE(!PageAnon(page), page);
+
if (PageCompound(page)) {
- result = SCAN_PAGE_COMPOUND;
- goto out;
- }
+ struct page *p;
+ page = compound_head(page);

- VM_BUG_ON_PAGE(!PageAnon(page), page);
+ /*
+ * Check if we have dealt with the compound page
+ * already
+ */
+ list_for_each_entry(p, compound_pagelist, lru) {
+ if (page == p)
+ goto next;
+ }
+ }

/*
* We can do it before isolate_lru_page because the
@@ -597,19 +619,15 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
result = SCAN_PAGE_COUNT;
goto out;
}
- if (pte_write(pteval)) {
- writable = true;
- } else {
- if (PageSwapCache(page) &&
- !reuse_swap_page(page, NULL)) {
- unlock_page(page);
- result = SCAN_SWAP_CACHE_PAGE;
- goto out;
- }
+ if (!pte_write(pteval) && PageSwapCache(page) &&
+ !reuse_swap_page(page, NULL)) {
/*
- * Page is not in the swap cache. It can be collapsed
- * into a THP.
+ * Page is in the swap cache and cannot be re-used.
+ * It cannot be collapsed into a THP.
*/
+ unlock_page(page);
+ result = SCAN_SWAP_CACHE_PAGE;
+ goto out;
}

/*
@@ -621,16 +639,23 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
result = SCAN_DEL_PAGE_LRU;
goto out;
}
- inc_node_page_state(page,
- NR_ISOLATED_ANON + page_is_file_cache(page));
+ mod_node_page_state(page_pgdat(page),
+ NR_ISOLATED_ANON + page_is_file_cache(page),
+ compound_nr(page));
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageLRU(page), page);

+ if (PageCompound(page))
+ list_add_tail(&page->lru, compound_pagelist);
+next:
/* There should be enough young pte to collapse the page */
if (pte_young(pteval) ||
page_is_young(page) || PageReferenced(page) ||
mmu_notifier_test_young(vma->vm_mm, address))
referenced++;
+
+ if (pte_write(pteval))
+ writable = true;
}
if (likely(writable)) {
if (likely(referenced)) {
@@ -644,7 +669,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
}

out:
- release_pte_pages(pte, _pte);
+ release_pte_pages(pte, _pte, compound_pagelist);
trace_mm_collapse_huge_page_isolate(page, none_or_zero,
referenced, writable, result);
return 0;
@@ -653,13 +678,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
struct vm_area_struct *vma,
unsigned long address,
- spinlock_t *ptl)
+ spinlock_t *ptl,
+ struct list_head *compound_pagelist)
{
+ struct page *src_page, *tmp;
pte_t *_pte;
for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
_pte++, page++, address += PAGE_SIZE) {
pte_t pteval = *_pte;
- struct page *src_page;

if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
clear_user_highpage(page, address);
@@ -679,7 +705,6 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
} else {
src_page = pte_page(pteval);
copy_user_highpage(page, src_page, address, vma);
- release_pte_page(src_page);
/*
* ptl mostly unnecessary, but preempt has to
* be disabled to update the per-cpu stats
@@ -693,9 +718,18 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
pte_clear(vma->vm_mm, address, _pte);
page_remove_rmap(src_page, false);
spin_unlock(ptl);
- free_page_and_swap_cache(src_page);
+ if (!PageCompound(src_page)) {
+ release_pte_page(src_page);
+ free_page_and_swap_cache(src_page);
+ }
}
}
+
+ list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) {
+ list_del(&src_page->lru);
+ release_pte_page(src_page);
+ free_page_and_swap_cache(src_page);
+ }
}

static void khugepaged_alloc_sleep(void)
@@ -960,6 +994,7 @@ static void collapse_huge_page(struct mm_struct *mm,
struct page **hpage,
int node, int referenced)
{
+ LIST_HEAD(compound_pagelist);
pmd_t *pmd, _pmd;
pte_t *pte;
pgtable_t pgtable;
@@ -1059,7 +1094,8 @@ static void collapse_huge_page(struct mm_struct *mm,
mmu_notifier_invalidate_range_end(&range);

spin_lock(pte_ptl);
- isolated = __collapse_huge_page_isolate(vma, address, pte);
+ isolated = __collapse_huge_page_isolate(vma, address, pte,
+ &compound_pagelist);
spin_unlock(pte_ptl);

if (unlikely(!isolated)) {
@@ -1084,7 +1120,8 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
anon_vma_unlock_write(vma->anon_vma);

- __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl);
+ __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl,
+ &compound_pagelist);
pte_unmap(pte);
__SetPageUptodate(new_page);
pgtable = pmd_pgtable(_pmd);
@@ -1181,11 +1218,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
goto out_unmap;
}

- /* TODO: teach khugepaged to collapse THP mapped with pte */
- if (PageCompound(page)) {
- result = SCAN_PAGE_COMPOUND;
- goto out_unmap;
- }
+ page = compound_head(page);

/*
* Record which node the original page is from and save this
--
2.26.0

2020-04-03 12:30:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 2/8] khugepaged: Do not stop collapse if less than half PTEs are referenced

__collapse_huge_page_swapin() check number of referenced PTE to decide
if the memory range is hot enough to justify swapin.

The problem is that it stops collapse altogether if there's not enough
referenced pages, not only swappingin.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Fixes: 0db501f7a34c ("mm, thp: convert from optimistic swapin collapsing to conservative")
Reviewed-by: Zi Yan <[email protected]>
---
mm/khugepaged.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 99bab7e4d05b..14d7afc90786 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -905,7 +905,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
/* we only decide to swapin, if there is enough young ptes */
if (referenced < HPAGE_PMD_NR/2) {
trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
- return false;
+ /* Do not block collapse, only skip swapping in */
+ return true;
}
vmf.pte = pte_offset_map(pmd, address);
for (; vmf.address < address + HPAGE_PMD_NR*PAGE_SIZE;
--
2.26.0

2020-04-03 12:30:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2 8/8] khugepaged: Introduce 'max_ptes_shared' tunable

``max_ptes_shared`` speicies how many pages can be shared across multiple
processes. Exeeding the number woul block the collapse::

/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared

A higher value may increase memory footprint for some workloads.

By default, at least half of pages has to be not shared.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/admin-guide/mm/transhuge.rst | 7 ++
include/trace/events/huge_memory.h | 3 +-
mm/khugepaged.c | 52 ++++++++++++--
tools/testing/selftests/vm/khugepaged.c | 83 ++++++++++++++++++++++
4 files changed, 140 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index bd5714547cee..d16e4f2bb70f 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -220,6 +220,13 @@ memory. A lower value can prevent THPs from being
collapsed, resulting fewer pages being collapsed into
THPs, and lower memory access performance.

+``max_ptes_shared`` speicies how many pages can be shared across multiple
+processes. Exeeding the number woul block the collapse::
+
+ /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared
+
+A higher value may increase memory footprint for some workloads.
+
Boot parameter
==============

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index d82a0f4e824d..53532f5925c3 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -12,6 +12,8 @@
EM( SCAN_SUCCEED, "succeeded") \
EM( SCAN_PMD_NULL, "pmd_null") \
EM( SCAN_EXCEED_NONE_PTE, "exceed_none_pte") \
+ EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \
+ EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \
EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \
EM( SCAN_PAGE_RO, "no_writable_page") \
EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \
@@ -30,7 +32,6 @@
EM( SCAN_DEL_PAGE_LRU, "could_not_delete_page_from_lru")\
EM( SCAN_ALLOC_HUGE_PAGE_FAIL, "alloc_huge_page_failed") \
EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") \
- EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \
EM( SCAN_TRUNCATED, "truncated") \
EMe(SCAN_PAGE_HAS_PRIVATE, "page_has_private") \

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 49e56e4e30d1..bfb6155f1d69 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -28,6 +28,8 @@ enum scan_result {
SCAN_SUCCEED,
SCAN_PMD_NULL,
SCAN_EXCEED_NONE_PTE,
+ SCAN_EXCEED_SWAP_PTE,
+ SCAN_EXCEED_SHARED_PTE,
SCAN_PTE_NON_PRESENT,
SCAN_PAGE_RO,
SCAN_LACK_REFERENCED_PAGE,
@@ -46,7 +48,6 @@ enum scan_result {
SCAN_DEL_PAGE_LRU,
SCAN_ALLOC_HUGE_PAGE_FAIL,
SCAN_CGROUP_CHARGE_FAIL,
- SCAN_EXCEED_SWAP_PTE,
SCAN_TRUNCATED,
SCAN_PAGE_HAS_PRIVATE,
};
@@ -71,6 +72,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
*/
static unsigned int khugepaged_max_ptes_none __read_mostly;
static unsigned int khugepaged_max_ptes_swap __read_mostly;
+static unsigned int khugepaged_max_ptes_shared __read_mostly;

#define MM_SLOTS_HASH_BITS 10
static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
@@ -290,15 +292,43 @@ static struct kobj_attribute khugepaged_max_ptes_swap_attr =
__ATTR(max_ptes_swap, 0644, khugepaged_max_ptes_swap_show,
khugepaged_max_ptes_swap_store);

+static ssize_t khugepaged_max_ptes_shared_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%u\n", khugepaged_max_ptes_shared);
+}
+
+static ssize_t khugepaged_max_ptes_shared_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err;
+ unsigned long max_ptes_shared;
+
+ err = kstrtoul(buf, 10, &max_ptes_shared);
+ if (err || max_ptes_shared > HPAGE_PMD_NR-1)
+ return -EINVAL;
+
+ khugepaged_max_ptes_shared = max_ptes_shared;
+
+ return count;
+}
+
+static struct kobj_attribute khugepaged_max_ptes_shared_attr =
+ __ATTR(max_ptes_shared, 0644, khugepaged_max_ptes_shared_show,
+ khugepaged_max_ptes_shared_store);
+
static struct attribute *khugepaged_attr[] = {
&khugepaged_defrag_attr.attr,
&khugepaged_max_ptes_none_attr.attr,
+ &khugepaged_max_ptes_swap_attr.attr,
+ &khugepaged_max_ptes_shared_attr.attr,
&pages_to_scan_attr.attr,
&pages_collapsed_attr.attr,
&full_scans_attr.attr,
&scan_sleep_millisecs_attr.attr,
&alloc_sleep_millisecs_attr.attr,
- &khugepaged_max_ptes_swap_attr.attr,
NULL,
};

@@ -360,6 +390,7 @@ int __init khugepaged_init(void)
khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
+ khugepaged_max_ptes_shared = HPAGE_PMD_NR / 2;

return 0;
}
@@ -549,7 +580,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
{
struct page *page = NULL;
pte_t *_pte;
- int none_or_zero = 0, result = 0, referenced = 0;
+ int none_or_zero = 0, shared = 0, result = 0, referenced = 0;
bool writable = false;

for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
@@ -577,6 +608,12 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,

VM_BUG_ON_PAGE(!PageAnon(page), page);

+ if (page_mapcount(page) > 1 &&
+ ++shared > khugepaged_max_ptes_shared) {
+ result = SCAN_EXCEED_SHARED_PTE;
+ goto out;
+ }
+
if (PageCompound(page)) {
struct page *p;
page = compound_head(page);
@@ -1168,7 +1205,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
{
pmd_t *pmd;
pte_t *pte, *_pte;
- int ret = 0, none_or_zero = 0, result = 0, referenced = 0;
+ int ret = 0, result = 0, referenced = 0;
+ int none_or_zero = 0, shared = 0;
struct page *page = NULL;
unsigned long _address;
spinlock_t *ptl;
@@ -1218,6 +1256,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
goto out_unmap;
}

+ if (page_mapcount(page) > 1 &&
+ ++shared > khugepaged_max_ptes_shared) {
+ result = SCAN_EXCEED_SHARED_PTE;
+ goto out_unmap;
+ }
+
page = compound_head(page);

/*
diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
index 3eeff4a0fbc9..9ae119234a39 100644
--- a/tools/testing/selftests/vm/khugepaged.c
+++ b/tools/testing/selftests/vm/khugepaged.c
@@ -77,6 +77,7 @@ struct khugepaged_settings {
unsigned int scan_sleep_millisecs;
unsigned int max_ptes_none;
unsigned int max_ptes_swap;
+ unsigned int max_ptes_shared;
unsigned long pages_to_scan;
};

@@ -276,6 +277,7 @@ static void write_settings(struct settings *settings)
khugepaged->scan_sleep_millisecs);
write_num("khugepaged/max_ptes_none", khugepaged->max_ptes_none);
write_num("khugepaged/max_ptes_swap", khugepaged->max_ptes_swap);
+ write_num("khugepaged/max_ptes_shared", khugepaged->max_ptes_shared);
write_num("khugepaged/pages_to_scan", khugepaged->pages_to_scan);
}

@@ -312,6 +314,7 @@ static void save_settings(void)
read_num("khugepaged/scan_sleep_millisecs"),
.max_ptes_none = read_num("khugepaged/max_ptes_none"),
.max_ptes_swap = read_num("khugepaged/max_ptes_swap"),
+ .max_ptes_shared = read_num("khugepaged/max_ptes_shared"),
.pages_to_scan = read_num("khugepaged/pages_to_scan"),
};
success("OK");
@@ -843,12 +846,90 @@ static void collapse_fork_compound(void)
fail("Fail");
fill_memory(p, 0, page_size);

+ write_num("khugepaged/max_ptes_shared", hpage_pmd_nr - 1);
if (wait_for_scan("Collapse PTE table full of compound pages in child", p))
fail("Timeout");
else if (check_huge(p))
success("OK");
else
fail("Fail");
+ write_num("khugepaged/max_ptes_shared",
+ default_settings.khugepaged.max_ptes_shared);
+
+ validate_memory(p, 0, hpage_pmd_size);
+ munmap(p, hpage_pmd_size);
+ exit(exit_status);
+ }
+
+ wait(&wstatus);
+ exit_status += WEXITSTATUS(wstatus);
+
+ printf("Check if parent still has huge page...");
+ if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+ validate_memory(p, 0, hpage_pmd_size);
+ munmap(p, hpage_pmd_size);
+}
+
+static void collapse_max_ptes_shared()
+{
+ int max_ptes_shared = read_num("khugepaged/max_ptes_shared");
+ int wstatus;
+ void *p;
+
+ p = alloc_mapping();
+
+ printf("Allocate huge page...");
+ madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
+ fill_memory(p, 0, hpage_pmd_size);
+ if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ printf("Share huge page over fork()...");
+ if (!fork()) {
+ /* Do not touch settings on child exit */
+ skip_settings_restore = true;
+ exit_status = 0;
+
+ if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ printf("Trigger CoW in %d of %d...",
+ hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr);
+ fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size);
+ if (!check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ if (wait_for_scan("Do not collapse with max_ptes_shared exeeded", p))
+ fail("Timeout");
+ else if (!check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+ printf("Trigger CoW in %d of %d...",
+ hpage_pmd_nr - max_ptes_shared, hpage_pmd_nr);
+ fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared) * page_size);
+ if (!check_huge(p))
+ success("OK");
+ else
+ fail("Fail");
+
+
+ if (wait_for_scan("Collapse with max_ptes_shared PTEs shared", p))
+ fail("Timeout");
+ else if (check_huge(p))
+ success("OK");
+ else
+ fail("Fail");

validate_memory(p, 0, hpage_pmd_size);
munmap(p, hpage_pmd_size);
@@ -877,6 +958,7 @@ int main(void)

default_settings.khugepaged.max_ptes_none = hpage_pmd_nr - 1;
default_settings.khugepaged.max_ptes_swap = hpage_pmd_nr / 8;
+ default_settings.khugepaged.max_ptes_shared = hpage_pmd_nr / 2;
default_settings.khugepaged.pages_to_scan = hpage_pmd_nr * 8;

save_settings();
@@ -894,6 +976,7 @@ int main(void)
collapse_compound_extreme();
collapse_fork();
collapse_fork_compound();
+ collapse_max_ptes_shared();

restore_settings(0);
}
--
2.26.0

2020-04-05 23:43:31

by William Kucharski

[permalink] [raw]
Subject: Re: [PATCHv2 0/8] thp/khugepaged improvements and CoW semantics


For the PATCHv2 series:

Reviewed-by: William Kucharski <[email protected]>

2020-04-06 13:19:50

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCHv2 8/8] khugepaged: Introduce 'max_ptes_shared' tunable

On 3 Apr 2020, at 7:29, Kirill A. Shutemov wrote:

> External email: Use caution opening links or attachments
>
>
> ``max_ptes_shared`` speicies how many pages can be shared across multiple

s/speicies/specifies

> processes. Exeeding the number woul block the collapse::

s/Exeeding/Exceeding

s/woul/would

>
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared
>
> A higher value may increase memory footprint for some workloads.
>
> By default, at least half of pages has to be not shared.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> Documentation/admin-guide/mm/transhuge.rst | 7 ++
> include/trace/events/huge_memory.h | 3 +-
> mm/khugepaged.c | 52 ++++++++++++--
> tools/testing/selftests/vm/khugepaged.c | 83 ++++++++++++++++++++++
> 4 files changed, 140 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index bd5714547cee..d16e4f2bb70f 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -220,6 +220,13 @@ memory. A lower value can prevent THPs from being
> collapsed, resulting fewer pages being collapsed into
> THPs, and lower memory access performance.
>
> +``max_ptes_shared`` speicies how many pages can be shared across multiple
> +processes. Exeeding the number woul block the collapse::

s/speicies/specifies

s/Exeeding/Exceeding

s/woul/would
> +
> + /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared
> +
> +A higher value may increase memory footprint for some workloads.
> +
> Boot parameter
> ==============
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index d82a0f4e824d..53532f5925c3 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -12,6 +12,8 @@
> EM( SCAN_SUCCEED, "succeeded") \
> EM( SCAN_PMD_NULL, "pmd_null") \
> EM( SCAN_EXCEED_NONE_PTE, "exceed_none_pte") \
> + EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \
> + EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \
> EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \
> EM( SCAN_PAGE_RO, "no_writable_page") \
> EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \
> @@ -30,7 +32,6 @@
> EM( SCAN_DEL_PAGE_LRU, "could_not_delete_page_from_lru")\
> EM( SCAN_ALLOC_HUGE_PAGE_FAIL, "alloc_huge_page_failed") \
> EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") \
> - EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \
> EM( SCAN_TRUNCATED, "truncated") \
> EMe(SCAN_PAGE_HAS_PRIVATE, "page_has_private") \
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 49e56e4e30d1..bfb6155f1d69 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -28,6 +28,8 @@ enum scan_result {
> SCAN_SUCCEED,
> SCAN_PMD_NULL,
> SCAN_EXCEED_NONE_PTE,
> + SCAN_EXCEED_SWAP_PTE,
> + SCAN_EXCEED_SHARED_PTE,
> SCAN_PTE_NON_PRESENT,
> SCAN_PAGE_RO,
> SCAN_LACK_REFERENCED_PAGE,
> @@ -46,7 +48,6 @@ enum scan_result {
> SCAN_DEL_PAGE_LRU,
> SCAN_ALLOC_HUGE_PAGE_FAIL,
> SCAN_CGROUP_CHARGE_FAIL,
> - SCAN_EXCEED_SWAP_PTE,
> SCAN_TRUNCATED,
> SCAN_PAGE_HAS_PRIVATE,
> };
> @@ -71,6 +72,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
> */
> static unsigned int khugepaged_max_ptes_none __read_mostly;
> static unsigned int khugepaged_max_ptes_swap __read_mostly;
> +static unsigned int khugepaged_max_ptes_shared __read_mostly;
>
> #define MM_SLOTS_HASH_BITS 10
> static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> @@ -290,15 +292,43 @@ static struct kobj_attribute khugepaged_max_ptes_swap_attr =
> __ATTR(max_ptes_swap, 0644, khugepaged_max_ptes_swap_show,
> khugepaged_max_ptes_swap_store);
>
> +static ssize_t khugepaged_max_ptes_shared_show(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + char *buf)
> +{
> + return sprintf(buf, "%u\n", khugepaged_max_ptes_shared);
> +}
> +
> +static ssize_t khugepaged_max_ptes_shared_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + int err;
> + unsigned long max_ptes_shared;
> +
> + err = kstrtoul(buf, 10, &max_ptes_shared);
> + if (err || max_ptes_shared > HPAGE_PMD_NR-1)
> + return -EINVAL;
> +
> + khugepaged_max_ptes_shared = max_ptes_shared;
> +
> + return count;
> +}
> +
> +static struct kobj_attribute khugepaged_max_ptes_shared_attr =
> + __ATTR(max_ptes_shared, 0644, khugepaged_max_ptes_shared_show,
> + khugepaged_max_ptes_shared_store);
> +
> static struct attribute *khugepaged_attr[] = {
> &khugepaged_defrag_attr.attr,
> &khugepaged_max_ptes_none_attr.attr,
> + &khugepaged_max_ptes_swap_attr.attr,
> + &khugepaged_max_ptes_shared_attr.attr,
> &pages_to_scan_attr.attr,
> &pages_collapsed_attr.attr,
> &full_scans_attr.attr,
> &scan_sleep_millisecs_attr.attr,
> &alloc_sleep_millisecs_attr.attr,
> - &khugepaged_max_ptes_swap_attr.attr,
> NULL,
> };
>
> @@ -360,6 +390,7 @@ int __init khugepaged_init(void)
> khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
> khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
> khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
> + khugepaged_max_ptes_shared = HPAGE_PMD_NR / 2;
>
> return 0;
> }
> @@ -549,7 +580,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> {
> struct page *page = NULL;
> pte_t *_pte;
> - int none_or_zero = 0, result = 0, referenced = 0;
> + int none_or_zero = 0, shared = 0, result = 0, referenced = 0;
> bool writable = false;
>
> for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
> @@ -577,6 +608,12 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>
> VM_BUG_ON_PAGE(!PageAnon(page), page);
>
> + if (page_mapcount(page) > 1 &&
> + ++shared > khugepaged_max_ptes_shared) {
> + result = SCAN_EXCEED_SHARED_PTE;
> + goto out;
> + }
> +
> if (PageCompound(page)) {
> struct page *p;
> page = compound_head(page);
> @@ -1168,7 +1205,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> {
> pmd_t *pmd;
> pte_t *pte, *_pte;
> - int ret = 0, none_or_zero = 0, result = 0, referenced = 0;
> + int ret = 0, result = 0, referenced = 0;
> + int none_or_zero = 0, shared = 0;
> struct page *page = NULL;
> unsigned long _address;
> spinlock_t *ptl;
> @@ -1218,6 +1256,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> goto out_unmap;
> }
>
> + if (page_mapcount(page) > 1 &&
> + ++shared > khugepaged_max_ptes_shared) {
> + result = SCAN_EXCEED_SHARED_PTE;
> + goto out_unmap;
> + }
> +
> page = compound_head(page);
>
> /*
> diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
> index 3eeff4a0fbc9..9ae119234a39 100644
> --- a/tools/testing/selftests/vm/khugepaged.c
> +++ b/tools/testing/selftests/vm/khugepaged.c
> @@ -77,6 +77,7 @@ struct khugepaged_settings {
> unsigned int scan_sleep_millisecs;
> unsigned int max_ptes_none;
> unsigned int max_ptes_swap;
> + unsigned int max_ptes_shared;
> unsigned long pages_to_scan;
> };
>
> @@ -276,6 +277,7 @@ static void write_settings(struct settings *settings)
> khugepaged->scan_sleep_millisecs);
> write_num("khugepaged/max_ptes_none", khugepaged->max_ptes_none);
> write_num("khugepaged/max_ptes_swap", khugepaged->max_ptes_swap);
> + write_num("khugepaged/max_ptes_shared", khugepaged->max_ptes_shared);
> write_num("khugepaged/pages_to_scan", khugepaged->pages_to_scan);
> }
>
> @@ -312,6 +314,7 @@ static void save_settings(void)
> read_num("khugepaged/scan_sleep_millisecs"),
> .max_ptes_none = read_num("khugepaged/max_ptes_none"),
> .max_ptes_swap = read_num("khugepaged/max_ptes_swap"),
> + .max_ptes_shared = read_num("khugepaged/max_ptes_shared"),
> .pages_to_scan = read_num("khugepaged/pages_to_scan"),
> };
> success("OK");
> @@ -843,12 +846,90 @@ static void collapse_fork_compound(void)
> fail("Fail");
> fill_memory(p, 0, page_size);
>
> + write_num("khugepaged/max_ptes_shared", hpage_pmd_nr - 1);
> if (wait_for_scan("Collapse PTE table full of compound pages in child", p))
> fail("Timeout");
> else if (check_huge(p))
> success("OK");
> else
> fail("Fail");
> + write_num("khugepaged/max_ptes_shared",
> + default_settings.khugepaged.max_ptes_shared);
> +
> + validate_memory(p, 0, hpage_pmd_size);
> + munmap(p, hpage_pmd_size);
> + exit(exit_status);
> + }
> +
> + wait(&wstatus);
> + exit_status += WEXITSTATUS(wstatus);
> +
> + printf("Check if parent still has huge page...");
> + if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> + validate_memory(p, 0, hpage_pmd_size);
> + munmap(p, hpage_pmd_size);
> +}
> +
> +static void collapse_max_ptes_shared()
> +{
> + int max_ptes_shared = read_num("khugepaged/max_ptes_shared");
> + int wstatus;
> + void *p;
> +
> + p = alloc_mapping();
> +
> + printf("Allocate huge page...");
> + madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
> + fill_memory(p, 0, hpage_pmd_size);
> + if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + printf("Share huge page over fork()...");
> + if (!fork()) {
> + /* Do not touch settings on child exit */
> + skip_settings_restore = true;
> + exit_status = 0;
> +
> + if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + printf("Trigger CoW in %d of %d...",
> + hpage_pmd_nr - max_ptes_shared - 1, hpage_pmd_nr);
> + fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared - 1) * page_size);
> + if (!check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + if (wait_for_scan("Do not collapse with max_ptes_shared exeeded", p))
> + fail("Timeout");
> + else if (!check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + printf("Trigger CoW in %d of %d...",
> + hpage_pmd_nr - max_ptes_shared, hpage_pmd_nr);
> + fill_memory(p, 0, (hpage_pmd_nr - max_ptes_shared) * page_size);
> + if (!check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> +
> + if (wait_for_scan("Collapse with max_ptes_shared PTEs shared", p))
> + fail("Timeout");
> + else if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
>
> validate_memory(p, 0, hpage_pmd_size);
> munmap(p, hpage_pmd_size);
> @@ -877,6 +958,7 @@ int main(void)
>
> default_settings.khugepaged.max_ptes_none = hpage_pmd_nr - 1;
> default_settings.khugepaged.max_ptes_swap = hpage_pmd_nr / 8;
> + default_settings.khugepaged.max_ptes_shared = hpage_pmd_nr / 2;
> default_settings.khugepaged.pages_to_scan = hpage_pmd_nr * 8;
>
> save_settings();
> @@ -894,6 +976,7 @@ int main(void)
> collapse_compound_extreme();
> collapse_fork();
> collapse_fork_compound();
> + collapse_max_ptes_shared();
>
> restore_settings(0);
> }
> --
> 2.26.0



Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-04-06 15:00:51

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCHv2 1/8] khugepaged: Add self test

On 3 Apr 2020, at 7:29, Kirill A. Shutemov wrote:

> External email: Use caution opening links or attachments
>
>
> The test checks if khugepaged is able to recover huge page where we
> expetect to do so. It only covers anon-THP for now.

s/expetect/expect

>
> Currenlty the test shows few failures. They are going to be addressed by

s/Currenlty/Currently

> the following patches.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> tools/testing/selftests/vm/Makefile | 1 +
> tools/testing/selftests/vm/khugepaged.c | 899 ++++++++++++++++++++++++
> 2 files changed, 900 insertions(+)
> create mode 100644 tools/testing/selftests/vm/khugepaged.c
>
> diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
> index 7f9a8a8c31da..981d0dc21f9e 100644
> --- a/tools/testing/selftests/vm/Makefile
> +++ b/tools/testing/selftests/vm/Makefile
> @@ -18,6 +18,7 @@ TEST_GEN_FILES += on-fault-limit
> TEST_GEN_FILES += thuge-gen
> TEST_GEN_FILES += transhuge-stress
> TEST_GEN_FILES += userfaultfd
> +TEST_GEN_FILES += khugepaged
>
> ifneq (,$(filter $(ARCH),arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x sh64 sparc64 x86_64))
> TEST_GEN_FILES += va_128TBswitch
> diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c
> new file mode 100644
> index 000000000000..3eeff4a0fbc9
> --- /dev/null
> +++ b/tools/testing/selftests/vm/khugepaged.c
> @@ -0,0 +1,899 @@
> +#define _GNU_SOURCE
> +#include <fcntl.h>
> +#include <limits.h>
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <unistd.h>
> +
> +#include <sys/mman.h>
> +#include <sys/wait.h>
> +
> +#ifndef MADV_PAGEOUT
> +#define MADV_PAGEOUT 21
> +#endif
> +
> +#define BASE_ADDR ((void *)(1UL << 30))
> +static unsigned long hpage_pmd_size;
> +static unsigned long page_size;
> +static int hpage_pmd_nr;
> +
> +#define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/"
> +
> +enum thp_enabled {
> + THP_ALWAYS,
> + THP_MADVISE,
> + THP_NEVER,
> +};
> +
> +static const char *thp_enabled_strings[] = {
> + "always",
> + "madvise",
> + "never",
> + NULL
> +};
> +
> +enum thp_defrag {
> + THP_DEFRAG_ALWAYS,
> + THP_DEFRAG_DEFER,
> + THP_DEFRAG_DEFER_MADVISE,
> + THP_DEFRAG_MADVISE,
> + THP_DEFRAG_NEVER,
> +};
> +
> +static const char *thp_defrag_strings[] = {
> + "always",
> + "defer",
> + "defer+madvise",
> + "madvise",
> + "never",
> + NULL
> +};
> +
> +enum shmem_enabled {
> + SHMEM_ALWAYS,
> + SHMEM_WITHIN_SIZE,
> + SHMEM_ADVISE,
> + SHMEM_NEVER,
> + SHMEM_DENY,
> + SHMEM_FORCE,
> +};
> +
> +static const char *shmem_enabled_strings[] = {
> + "always",
> + "within_size",
> + "advise",
> + "never",
> + "deny",
> + "force",
> + NULL
> +};
> +
> +struct khugepaged_settings {
> + bool defrag;
> + unsigned int alloc_sleep_millisecs;
> + unsigned int scan_sleep_millisecs;
> + unsigned int max_ptes_none;
> + unsigned int max_ptes_swap;
> + unsigned long pages_to_scan;
> +};
> +
> +struct settings {
> + enum thp_enabled thp_enabled;
> + enum thp_defrag thp_defrag;
> + enum shmem_enabled shmem_enabled;
> + bool debug_cow;
> + bool use_zero_page;
> + struct khugepaged_settings khugepaged;
> +};
> +
> +static struct settings default_settings = {
> + .thp_enabled = THP_MADVISE,
> + .thp_defrag = THP_DEFRAG_ALWAYS,
> + .shmem_enabled = SHMEM_NEVER,
> + .debug_cow = 0,
> + .use_zero_page = 0,
> + .khugepaged = {
> + .defrag = 1,
> + .alloc_sleep_millisecs = 10,
> + .scan_sleep_millisecs = 10,
> + },
> +};
> +
> +static struct settings saved_settings;
> +static bool skip_settings_restore;
> +
> +static int exit_status;
> +
> +static void success(const char *msg)
> +{
> + printf(" \e[32m%s\e[0m\n", msg);
> +}
> +
> +static void fail(const char *msg)
> +{
> + printf(" \e[31m%s\e[0m\n", msg);
> + exit_status++;
> +}
> +
> +static int read_file(const char *path, char *buf, size_t buflen)
> +{
> + int fd;
> + ssize_t numread;
> +
> + fd = open(path, O_RDONLY);
> + if (fd == -1)
> + return 0;
> +
> + numread = read(fd, buf, buflen - 1);
> + if (numread < 1) {
> + close(fd);
> + return 0;
> + }
> +
> + buf[numread] = '\0';
> + close(fd);
> +
> + return (unsigned int) numread;
> +}
> +
> +static int write_file(const char *path, const char *buf, size_t buflen)
> +{
> + int fd;
> + ssize_t numwritten;
> +
> + fd = open(path, O_WRONLY);
> + if (fd == -1)
> + return 0;
> +
> + numwritten = write(fd, buf, buflen - 1);
> + close(fd);
> + if (numwritten < 1)
> + return 0;
> +
> + return (unsigned int) numwritten;
> +}
> +
> +static int read_string(const char *name, const char *strings[])
> +{
> + char path[PATH_MAX];
> + char buf[256];
> + char *c;
> + int ret;
> +
> + ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
> + if (ret >= PATH_MAX) {
> + printf("%s: Pathname is too long\n", __func__);
> + exit(EXIT_FAILURE);
> + }
> +
> + if (!read_file(path, buf, sizeof(buf))) {
> + perror(path);
> + exit(EXIT_FAILURE);
> + }
> +
> + c = strchr(buf, '[');
> + if (!c) {
> + printf("%s: Parse failure\n", __func__);
> + exit(EXIT_FAILURE);
> + }
> +
> + c++;
> + memmove(buf, c, sizeof(buf) - (c - buf));
> +
> + c = strchr(buf, ']');
> + if (!c) {
> + printf("%s: Parse failure\n", __func__);
> + exit(EXIT_FAILURE);
> + }
> + *c = '\0';
> +
> + ret = 0;
> + while (strings[ret]) {
> + if (!strcmp(strings[ret], buf))
> + return ret;
> + ret++;
> + }
> +
> + printf("Failed to parse %s\n", name);
> + exit(EXIT_FAILURE);
> +}
> +
> +static void write_string(const char *name, const char *val)
> +{
> + char path[PATH_MAX];
> + int ret;
> +
> + ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
> + if (ret >= PATH_MAX) {
> + printf("%s: Pathname is too long\n", __func__);
> + exit(EXIT_FAILURE);
> + }
> +
> + if (!write_file(path, val, strlen(val) + 1)) {
> + perror(path);
> + exit(EXIT_FAILURE);
> + }
> +}
> +
> +static const unsigned long read_num(const char *name)
> +{
> + char path[PATH_MAX];
> + char buf[21];
> + int ret;
> +
> + ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
> + if (ret >= PATH_MAX) {
> + printf("%s: Pathname is too long\n", __func__);
> + exit(EXIT_FAILURE);
> + }
> +
> + ret = read_file(path, buf, sizeof(buf));
> + if (ret < 0) {
> + perror("read_file(read_num)");
> + exit(EXIT_FAILURE);
> + }
> +
> + return strtoul(buf, NULL, 10);
> +}
> +
> +static void write_num(const char *name, unsigned long num)
> +{
> + char path[PATH_MAX];
> + char buf[21];
> + int ret;
> +
> + ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name);
> + if (ret >= PATH_MAX) {
> + printf("%s: Pathname is too long\n", __func__);
> + exit(EXIT_FAILURE);
> + }
> +
> + sprintf(buf, "%ld", num);
> + if (!write_file(path, buf, strlen(buf) + 1)) {
> + perror(path);
> + exit(EXIT_FAILURE);
> + }
> +}
> +
> +static void write_settings(struct settings *settings)
> +{
> + struct khugepaged_settings *khugepaged = &settings->khugepaged;
> +
> + write_string("enabled", thp_enabled_strings[settings->thp_enabled]);
> + write_string("defrag", thp_defrag_strings[settings->thp_defrag]);
> + write_string("shmem_enabled",
> + shmem_enabled_strings[settings->shmem_enabled]);
> + write_num("debug_cow", settings->debug_cow);
> + write_num("use_zero_page", settings->use_zero_page);
> +
> + write_num("khugepaged/defrag", khugepaged->defrag);
> + write_num("khugepaged/alloc_sleep_millisecs",
> + khugepaged->alloc_sleep_millisecs);
> + write_num("khugepaged/scan_sleep_millisecs",
> + khugepaged->scan_sleep_millisecs);
> + write_num("khugepaged/max_ptes_none", khugepaged->max_ptes_none);
> + write_num("khugepaged/max_ptes_swap", khugepaged->max_ptes_swap);
> + write_num("khugepaged/pages_to_scan", khugepaged->pages_to_scan);
> +}
> +
> +static void restore_settings(int sig)
> +{
> + if (skip_settings_restore)
> + goto out;
> +
> + printf("Restore THP and khugepaged settings...");
> + write_settings(&saved_settings);
> + success("OK");
> + if (sig)
> + exit(EXIT_FAILURE);
> +out:
> + exit(exit_status);
> +}
> +
> +static void save_settings(void)
> +{
> + printf("Save THP and khugepaged settings...");
> + saved_settings = (struct settings) {
> + .thp_enabled = read_string("enabled", thp_enabled_strings),
> + .thp_defrag = read_string("defrag", thp_defrag_strings),
> + .shmem_enabled =
> + read_string("shmem_enabled", shmem_enabled_strings),
> + .debug_cow = read_num("debug_cow"),
> + .use_zero_page = read_num("use_zero_page"),
> + };
> + saved_settings.khugepaged = (struct khugepaged_settings) {
> + .defrag = read_num("khugepaged/defrag"),
> + .alloc_sleep_millisecs =
> + read_num("khugepaged/alloc_sleep_millisecs"),
> + .scan_sleep_millisecs =
> + read_num("khugepaged/scan_sleep_millisecs"),
> + .max_ptes_none = read_num("khugepaged/max_ptes_none"),
> + .max_ptes_swap = read_num("khugepaged/max_ptes_swap"),
> + .pages_to_scan = read_num("khugepaged/pages_to_scan"),
> + };
> + success("OK");
> +
> + signal(SIGTERM, restore_settings);
> + signal(SIGINT, restore_settings);
> + signal(SIGHUP, restore_settings);
> + signal(SIGQUIT, restore_settings);
> +}
> +
> +static void adjust_settings(void)
> +{
> +
> + printf("Adjust settings...");
> + write_settings(&default_settings);
> + success("OK");
> +}
> +
> +#define CHECK_HUGE_FMT "sed -ne " \
> + "'/^%lx/,/^AnonHugePages/{/^AnonHugePages:\\s*%ld kB/ q1}' " \
> + "/proc/%d/smaps"
> +
> +static bool check_huge(void *p)
> +{
> + char *cmd;
> + int ret;
> +
> + ret = asprintf(&cmd, CHECK_HUGE_FMT,
> + (unsigned long)p, hpage_pmd_size >> 10, getpid());
> + if (ret < 0) {
> + perror("asprintf(CHECK_FMT)");
> + exit(EXIT_FAILURE);
> + }
> +
> + ret = system(cmd);
> + free(cmd);
> + if (ret < 0 || !WIFEXITED(ret)) {
> + perror("system(check_huge)");
> + exit(EXIT_FAILURE);
> + }
> +
> + return WEXITSTATUS(ret);
> +}
> +
> +#define CHECK_SWAP_FMT "sed -ne " \
> + "'/^%lx/,/^Swap:/{/^Swap:\\s*%ld kB/ q1}' " \
> + "/proc/%d/smaps"
> +
> +static bool check_swap(void *p, unsigned long size)
> +{
> + char *cmd;
> + int ret;
> +
> + ret = asprintf(&cmd, CHECK_SWAP_FMT,
> + (unsigned long)p, size >> 10, getpid());
> + if (ret < 0) {
> + perror("asprintf(CHECK_SWAP)");
> + exit(EXIT_FAILURE);
> + }
> +
> + ret = system(cmd);
> + free(cmd);
> + if (ret < 0 || !WIFEXITED(ret)) {
> + perror("system(check_swap)");
> + exit(EXIT_FAILURE);
> + }
> +
> + return WEXITSTATUS(ret);
> +}
> +
> +static void *alloc_mapping(void)
> +{
> + void *p;
> +
> + p = mmap(BASE_ADDR, hpage_pmd_size, PROT_READ | PROT_WRITE,
> + MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> + if (p != BASE_ADDR) {
> + printf("Failed to allocate VMA at %p\n", BASE_ADDR);
> + exit(EXIT_FAILURE);
> + }
> +
> + return p;
> +}
> +
> +static void fill_memory(int *p, unsigned long start, unsigned long end)
> +{
> + int i;
> +
> + for (i = start / page_size; i < end / page_size; i++)
> + p[i * page_size / sizeof(*p)] = i + 0xdead0000;
> +}
> +
> +static void validate_memory(int *p, unsigned long start, unsigned long end)
> +{
> + int i;
> +
> + for (i = start / page_size; i < end / page_size; i++) {
> + if (p[i * page_size / sizeof(*p)] != i + 0xdead0000) {
> + printf("Page %d is corrupted: %#x\n",
> + i, p[i * page_size / sizeof(*p)]);
> + exit(EXIT_FAILURE);
> + }
> + }
> +}
> +
> +#define TICK 500000
> +static bool wait_for_scan(const char *msg, char *p)
> +{
> + int full_scans;
> + int timeout = 6; /* 3 seconds */
> +
> + /* Sanity check */
> + if (check_huge(p)) {
> + printf("Unexpected huge page\n");
> + exit(EXIT_FAILURE);
> + }
> +
> + madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
> +
> + /* Wait until the second full_scan completed */
> + full_scans = read_num("khugepaged/full_scans") + 2;
> +
> + printf("%s...", msg);
> + while (timeout--) {
> + if (check_huge(p))
> + break;
> + if (read_num("khugepaged/full_scans") >= full_scans)
> + break;
> + printf(".");
> + usleep(TICK);
> + }
> +
> + madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
> +
> + return !timeout;
> +}
> +
> +static void alloc_at_fault(void)
> +{
> + struct settings settings = default_settings;
> + char *p;
> +
> + settings.thp_enabled = THP_ALWAYS;
> + write_settings(&settings);
> +
> + p = alloc_mapping();
> + *p = 1;
> + printf("Allocate huge page on fault...");
> + if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + write_settings(&default_settings);
> +
> + madvise(p, page_size, MADV_DONTNEED);
> + printf("Split huge PMD on MADV_DONTNEED...");
> + if (!check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> + munmap(p, hpage_pmd_size);
> +}
> +
> +static void collapse_full(void)
> +{
> + void *p;
> +
> + p = alloc_mapping();
> + fill_memory(p, 0, hpage_pmd_size);
> + if (wait_for_scan("Collapse fully populated PTE table", p))
> + fail("Timeout");
> + else if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> + validate_memory(p, 0, hpage_pmd_size);
> + munmap(p, hpage_pmd_size);
> +}
> +
> +static void collapse_empty(void)
> +{
> + void *p;
> +
> + p = alloc_mapping();
> + if (wait_for_scan("Do not collapse empty PTE table", p))
> + fail("Timeout");
> + else if (check_huge(p))
> + fail("Fail");
> + else
> + success("OK");
> + munmap(p, hpage_pmd_size);
> +}
> +
> +static void collapse_signle_pte_entry(void)
> +{
> + void *p;
> +
> + p = alloc_mapping();
> + fill_memory(p, 0, page_size);
> + if (wait_for_scan("Collapse PTE table with single PTE entry present", p))
> + fail("Timeout");
> + else if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> + validate_memory(p, 0, page_size);
> + munmap(p, hpage_pmd_size);
> +}
> +
> +static void collapse_max_ptes_none(void)
> +{
> + int max_ptes_none = hpage_pmd_nr / 2;
> + struct settings settings = default_settings;
> + void *p;
> +
> + settings.khugepaged.max_ptes_none = max_ptes_none;
> + write_settings(&settings);
> +
> + p = alloc_mapping();
> +
> + fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
> + if (wait_for_scan("Do not collapse with max_ptes_none exeeded", p))
> + fail("Timeout");
> + else if (check_huge(p))
> + fail("Fail");
> + else
> + success("OK");
> + validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size);
> +
> + fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
> + if (wait_for_scan("Collapse with max_ptes_none PTEs empty", p))
> + fail("Timeout");
> + else if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> + validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size);
> +
> + munmap(p, hpage_pmd_size);
> + write_settings(&default_settings);
> +}
> +
> +static void collapse_swapin_single_pte(void)
> +{
> + void *p;
> + p = alloc_mapping();
> + fill_memory(p, 0, hpage_pmd_size);
> +
> + printf("Swapout one page...");
> + if (madvise(p, page_size, MADV_PAGEOUT)) {
> + perror("madvise(MADV_PAGEOUT)");
> + exit(EXIT_FAILURE);
> + }
> + if (check_swap(p, page_size)) {
> + success("OK");
> + } else {
> + fail("Fail");
> + goto out;
> + }
> +
> + if (wait_for_scan("Collapse with swaping in single PTE entry", p))
> + fail("Timeout");
> + else if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> + validate_memory(p, 0, hpage_pmd_size);
> +out:
> + munmap(p, hpage_pmd_size);
> +}
> +
> +static void collapse_max_ptes_swap(void)
> +{
> + int max_ptes_swap = read_num("khugepaged/max_ptes_swap");
> + void *p;
> +
> + p = alloc_mapping();
> +
> + fill_memory(p, 0, hpage_pmd_size);
> + printf("Swapout %d of %d pages...", max_ptes_swap + 1, hpage_pmd_nr);
> + if (madvise(p, (max_ptes_swap + 1) * page_size, MADV_PAGEOUT)) {
> + perror("madvise(MADV_PAGEOUT)");
> + exit(EXIT_FAILURE);
> + }
> + if (check_swap(p, (max_ptes_swap + 1) * page_size)) {
> + success("OK");
> + } else {
> + fail("Fail");
> + goto out;
> + }
> +
> + if (wait_for_scan("Do not collapse with max_ptes_swap exeeded", p))
> + fail("Timeout");
> + else if (check_huge(p))
> + fail("Fail");
> + else
> + success("OK");
> + validate_memory(p, 0, hpage_pmd_size);
> +
> + fill_memory(p, 0, hpage_pmd_size);
> + printf("Swapout %d of %d pages...", max_ptes_swap, hpage_pmd_nr);
> + if (madvise(p, max_ptes_swap * page_size, MADV_PAGEOUT)) {
> + perror("madvise(MADV_PAGEOUT)");
> + exit(EXIT_FAILURE);
> + }
> + if (check_swap(p, max_ptes_swap * page_size)) {
> + success("OK");
> + } else {
> + fail("Fail");
> + goto out;
> + }
> +
> + if (wait_for_scan("Collapse with max_ptes_swap pages swapped out", p))
> + fail("Timeout");
> + else if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> + validate_memory(p, 0, hpage_pmd_size);
> +out:
> + munmap(p, hpage_pmd_size);
> +}
> +
> +static void collapse_signle_pte_entry_compound(void)
> +{
> + void *p;
> +
> + p = alloc_mapping();
> +
> + printf("Allocate huge page...");
> + madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
> + fill_memory(p, 0, hpage_pmd_size);
> + if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> + madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
> +
> + printf("Split huge page leaving single PTE mapping compount page...");
> + madvise(p + page_size, hpage_pmd_size - page_size, MADV_DONTNEED);
> + if (!check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + if (wait_for_scan("Collapse PTE table with single PTE mapping compount page", p))
> + fail("Timeout");
> + else if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> + validate_memory(p, 0, page_size);
> + munmap(p, hpage_pmd_size);
> +}
> +
> +static void collapse_full_of_compound(void)
> +{
> + void *p;
> +
> + p = alloc_mapping();
> +
> + printf("Allocate huge page...");
> + madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
> + fill_memory(p, 0, hpage_pmd_size);
> + if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + printf("Split huge page leaving single PTE page table full of compount pages...");
> + madvise(p, page_size, MADV_NOHUGEPAGE);
> + madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
> + if (!check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + if (wait_for_scan("Collapse PTE table full of compound pages", p))
> + fail("Timeout");
> + else if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> + validate_memory(p, 0, hpage_pmd_size);
> + munmap(p, hpage_pmd_size);
> +}
> +
> +static void collapse_compound_extreme(void)
> +{
> + void *p;
> + int i;
> +
> + p = alloc_mapping();
> + for (i = 0; i < hpage_pmd_nr; i++) {
> + printf("\rConstruct PTE page table full of different PTE-mapped compound pages %3d/%d...",
> + i + 1, hpage_pmd_nr);
> +
> + madvise(BASE_ADDR, hpage_pmd_size, MADV_HUGEPAGE);
> + fill_memory(BASE_ADDR, 0, hpage_pmd_size);
> + if (!check_huge(BASE_ADDR)) {
> + printf("Failed to allocate huge page\n");
> + exit(EXIT_FAILURE);
> + }
> + madvise(BASE_ADDR, hpage_pmd_size, MADV_NOHUGEPAGE);
> +
> + p = mremap(BASE_ADDR - i * page_size,
> + i * page_size + hpage_pmd_size,
> + (i + 1) * page_size,
> + MREMAP_MAYMOVE | MREMAP_FIXED,
> + BASE_ADDR + 2 * hpage_pmd_size);
> + if (p == MAP_FAILED) {
> + perror("mremap+unmap");
> + exit(EXIT_FAILURE);
> + }
> +
> + p = mremap(BASE_ADDR + 2 * hpage_pmd_size,
> + (i + 1) * page_size,
> + (i + 1) * page_size + hpage_pmd_size,
> + MREMAP_MAYMOVE | MREMAP_FIXED,
> + BASE_ADDR - (i + 1) * page_size);
> + if (p == MAP_FAILED) {
> + perror("mremap+alloc");
> + exit(EXIT_FAILURE);
> + }
> + }
> +
> + munmap(BASE_ADDR, hpage_pmd_size);
> + fill_memory(p, 0, hpage_pmd_size);
> + if (!check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + if (wait_for_scan("Collapse PTE table full of different compound pages", p))
> + fail("Timeout");
> + else if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + validate_memory(p, 0, hpage_pmd_size);
> + munmap(p, hpage_pmd_size);
> +}
> +
> +static void collapse_fork(void)
> +{
> + int wstatus;
> + void *p;
> +
> + p = alloc_mapping();
> +
> + printf("Allocate small page...");
> + fill_memory(p, 0, page_size);
> + if (!check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + printf("Share small page over fork()...");
> + if (!fork()) {
> + /* Do not touch settings on child exit */
> + skip_settings_restore = true;
> + exit_status = 0;
> +
> + if (!check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + fill_memory(p, page_size, 2 * page_size);
> +
> + if (wait_for_scan("Collapse PTE table with single page shared with parent process", p))
> + fail("Timeout");
> + else if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + validate_memory(p, 0, page_size);
> + munmap(p, hpage_pmd_size);
> + exit(exit_status);
> + }
> +
> + wait(&wstatus);
> + exit_status += WEXITSTATUS(wstatus);
> +
> + printf("Check if parent still has small page...");
> + if (!check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> + validate_memory(p, 0, page_size);
> + munmap(p, hpage_pmd_size);
> +}
> +
> +static void collapse_fork_compound(void)
> +{
> + int wstatus;
> + void *p;
> +
> + p = alloc_mapping();
> +
> + printf("Allocate huge page...");
> + madvise(p, hpage_pmd_size, MADV_HUGEPAGE);
> + fill_memory(p, 0, hpage_pmd_size);
> + if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + printf("Share huge page over fork()...");
> + if (!fork()) {
> + /* Do not touch settings on child exit */
> + skip_settings_restore = true;
> + exit_status = 0;
> +
> + if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + printf("Split huge page PMD in child process...");
> + madvise(p, page_size, MADV_NOHUGEPAGE);
> + madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE);
> + if (!check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> + fill_memory(p, 0, page_size);
> +
> + if (wait_for_scan("Collapse PTE table full of compound pages in child", p))
> + fail("Timeout");
> + else if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> +
> + validate_memory(p, 0, hpage_pmd_size);
> + munmap(p, hpage_pmd_size);
> + exit(exit_status);
> + }
> +
> + wait(&wstatus);
> + exit_status += WEXITSTATUS(wstatus);
> +
> + printf("Check if parent still has huge page...");
> + if (check_huge(p))
> + success("OK");
> + else
> + fail("Fail");
> + validate_memory(p, 0, hpage_pmd_size);
> + munmap(p, hpage_pmd_size);
> +}
> +
> +int main(void)
> +{
> + setbuf(stdout, NULL);
> +
> + page_size = getpagesize();
> + hpage_pmd_size = read_num("hpage_pmd_size");
> + hpage_pmd_nr = hpage_pmd_size / page_size;
> +
> + default_settings.khugepaged.max_ptes_none = hpage_pmd_nr - 1;
> + default_settings.khugepaged.max_ptes_swap = hpage_pmd_nr / 8;
> + default_settings.khugepaged.pages_to_scan = hpage_pmd_nr * 8;
> +
> + save_settings();
> + adjust_settings();
> +
> + alloc_at_fault();
> + collapse_full();
> + collapse_empty();
> + collapse_signle_pte_entry();

s/signle/single

> + collapse_max_ptes_none();
> + collapse_swapin_single_pte();
> + collapse_max_ptes_swap();
> + collapse_signle_pte_entry_compound();

s/signle/single

> + collapse_full_of_compound();
> + collapse_compound_extreme();
> + collapse_fork();
> + collapse_fork_compound();
> +
> + restore_settings(0);
> +}
> --
> 2.26.0

I ran this test with all patches from this series applied to Linus’s tree, but still see several failures. Is it expected?
The config file is attached. Let me know if I miss anything. BTW, I am running in a VM.

Thanks.

The output:

➜ ~ sudo ./khugepaged
Save THP and khugepaged settings... OK
Adjust settings... OK
Allocate huge page on fault... OK
Split huge PMD on MADV_DONTNEED... OK
Collapse fully populated PTE table.... Fail
Do not collapse empty PTE table.... OK
Collapse PTE table with single PTE entry present.... Fail
Do not collapse with max_ptes_none exeeded.... OK
Collapse with max_ptes_none PTEs empty.... Fail
Swapout one page... OK
Collapse with swaping in single PTE entry.... Fail
Swapout 65 of 512 pages... OK
Do not collapse with max_ptes_swap exeeded.... OK
Swapout 64 of 512 pages... OK
Collapse with max_ptes_swap pages swapped out.... Fail
Allocate huge page... OK
Split huge page leaving single PTE mapping compount page... OK
Collapse PTE table with single PTE mapping compount page.... Fail
Allocate huge page... OK
Split huge page leaving single PTE page table full of compount pages... OK
Collapse PTE table full of compound pages.... Fail
Construct PTE page table full of different PTE-mapped compound pages 512/512... OK
Collapse PTE table full of different compound pages.... Fail
Allocate small page... OK
Share small page over fork()... OK
Collapse PTE table with single page shared with parent process.... Fail
Check if parent still has small page... OK
Allocate huge page... OK
Share huge page over fork()... OK
Split huge page PMD in child process... OK
Collapse PTE table full of compound pages in child.... Fail
Check if parent still has huge page... OK
Allocate huge page... OK
Share huge page over fork()... OK
Trigger CoW in 255 of 512... OK
Do not collapse with max_ptes_shared exeeded.... OK
Trigger CoW in 256 of 512... OK
Collapse with max_ptes_shared PTEs shared.... Fail
Check if parent still has huge page... OK
Restore THP and khugepaged settings... OK




Best Regards,
Yan Zi


Attachments:
.config (130.07 kB)
signature.asc (871.00 B)
OpenPGP digital signature
Download all attachments

2020-04-06 15:21:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 1/8] khugepaged: Add self test

On Mon, Apr 06, 2020 at 10:59:52AM -0400, Zi Yan wrote:
> I ran this test with all patches from this series applied to Linus’s
> tree, but still see several failures. Is it expected?

[Thanks for catching all my typos]

No. It works fine for me.
Well, occasionally, it fails to swap out a page, but nothing like you have
here.

Could you try to trace it? I used script like below after mounting
tracefs. You may want to comment out all tests but a failing one.

#!/bin/sh -efu

echo > /sys/kernel/tracing/trace
echo 1 > /sys/kernel/tracing/events/huge_memory/enable

while ./khugepaged; do
echo > /sys/kernel/tracing/trace
done

cat /sys/kernel/tracing/trace

--
Kirill A. Shutemov

2020-04-06 18:16:14

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCHv2 2/8] khugepaged: Do not stop collapse if less than half PTEs are referenced



On 4/3/20 4:29 AM, Kirill A. Shutemov wrote:
> __collapse_huge_page_swapin() check number of referenced PTE to decide
> if the memory range is hot enough to justify swapin.
>
> The problem is that it stops collapse altogether if there's not enough
> referenced pages, not only swappingin.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Fixes: 0db501f7a34c ("mm, thp: convert from optimistic swapin collapsing to conservative")
> Reviewed-by: Zi Yan <[email protected]>
> ---
> mm/khugepaged.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)

Acked-by: Yang Shi <[email protected]>

>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 99bab7e4d05b..14d7afc90786 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -905,7 +905,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> /* we only decide to swapin, if there is enough young ptes */
> if (referenced < HPAGE_PMD_NR/2) {
> trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> - return false;
> + /* Do not block collapse, only skip swapping in */
> + return true;
> }
> vmf.pte = pte_offset_map(pmd, address);
> for (; vmf.address < address + HPAGE_PMD_NR*PAGE_SIZE;

2020-04-06 18:54:47

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCHv2 1/8] khugepaged: Add self test

On 6 Apr 2020, at 11:20, Kirill A. Shutemov wrote:

> External email: Use caution opening links or attachments
>
>
> On Mon, Apr 06, 2020 at 10:59:52AM -0400, Zi Yan wrote:
>> I ran this test with all patches from this series applied to Linus’s
>> tree, but still see several failures. Is it expected?
>
> [Thanks for catching all my typos]
>
> No. It works fine for me.
> Well, occasionally, it fails to swap out a page, but nothing like you have
> here.
>
> Could you try to trace it? I used script like below after mounting
> tracefs. You may want to comment out all tests but a failing one.


Sure.

>
> #!/bin/sh -efu
>
> echo > /sys/kernel/tracing/trace
> echo 1 > /sys/kernel/tracing/events/huge_memory/enable
>
> while ./khugepaged; do
> echo > /sys/kernel/tracing/trace
> done
>
> cat /sys/kernel/tracing/trace

The test output is (should be useful to tell you the failed tests):

root@nvrsysarch-yz:/home/yanzi# ./trace.sh
Save THP and khugepaged settings... OK
Adjust settings... OK
Collapse fully populated PTE table.... Fail
Collapse PTE table with single PTE entry present.... Fail
Collapse with max_ptes_none PTEs empty.... Fail
Swapout one page... OK
Collapse with swaping in single PTE entry.... Fail
Swapout 64 of 512 pages... OK
Collapse with max_ptes_swap pages swapped out.... Fail
Allocate huge page... OK
Split huge page leaving single PTE mapping compount page... OK
Collapse PTE table with single PTE mapping compount page.... Fail
Allocate huge page... OK
Split huge page leaving single PTE page table full of compount pages... OK
Collapse PTE table full of compound pages.... Fail
Construct PTE page table full of different PTE-mapped compound pages 512/512... OK
Collapse PTE table full of different compound pages.... Fail
Allocate small page... OK
Share small page over fork()... OK
Collapse PTE table with single page shared with parent process.... Fail
Check if parent still has small page... OK
Allocate huge page... OK
Share huge page over fork()... OK
Split huge page PMD in child process... OK
Collapse PTE table full of compound pages in child.... Fail
Check if parent still has huge page... OK
Allocate huge page... OK
Share huge page over fork()... OK
Trigger CoW in 255 of 512... OK
Do not collapse with max_ptes_shared exeeded.... OK
Trigger CoW in 256 of 512... OK
Collapse with max_ptes_shared PTEs shared.... Fail
Check if parent still has huge page... OK
Restore THP and khugepaged settings... OK


The trace is attached.



Best Regards,
Yan Zi


Attachments:
khugepaged.trace (179.23 kB)
signature.asc (871.00 B)
OpenPGP digital signature
Download all attachments

2020-04-06 18:55:58

by Ralph Campbell

[permalink] [raw]
Subject: Re: [PATCHv2 1/8] khugepaged: Add self test


On 4/3/20 4:29 AM, Kirill A. Shutemov wrote:
> The test checks if khugepaged is able to recover huge page where we
> expetect to do so. It only covers anon-THP for now.
>
> Currenlty the test shows few failures. They are going to be addressed by
> the following patches.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

One minor nit below but otherwise looks OK.
Reviewed-by: Ralph Campbell <[email protected]>


> +
> +static void collapse_signle_pte_entry(void)
> +{

Shouldn't this be collapse_single_pte_entry()?
(and change the places where it is called)

2020-04-06 19:53:56

by Ralph Campbell

[permalink] [raw]
Subject: Re: [PATCHv2 2/8] khugepaged: Do not stop collapse if less than half PTEs are referenced


On 4/3/20 4:29 AM, Kirill A. Shutemov wrote:
> __collapse_huge_page_swapin() check number of referenced PTE to decide
> if the memory range is hot enough to justify swapin.
>
> The problem is that it stops collapse altogether if there's not enough
> referenced pages, not only swappingin.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Fixes: 0db501f7a34c ("mm, thp: convert from optimistic swapin collapsing to conservative")
> Reviewed-by: Zi Yan <[email protected]> ---
> mm/khugepaged.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 99bab7e4d05b..14d7afc90786 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -905,7 +905,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> /* we only decide to swapin, if there is enough young ptes */
> if (referenced < HPAGE_PMD_NR/2) {
> trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);

The trace point is recording the return value. Shouldn't you s/0/1 to match?

> - return false;
> + /* Do not block collapse, only skip swapping in */
> + return true;
> }
> vmf.pte = pte_offset_map(pmd, address);
> for (; vmf.address < address + HPAGE_PMD_NR*PAGE_SIZE;
>

So "referenced < HPAGE_PMD_NR/2" means swapped out pages aren't faulted back in
but there could still be no swapped out pages, just "old" pages so collapse could
succeed. Seems like this check could have been made in khugepaged_scan_pmd() when
"referenced" is being computed and "unmapped" is available. Just skip setting
"ret = 1" if unmapped > 0 && referenced < HPAGE_PMD_NR/2.

2020-04-06 21:30:57

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCHv2 6/8] khugepaged: Allow to collapse PTE-mapped compound pages



On 4/3/20 4:29 AM, Kirill A. Shutemov wrote:
> We can collapse PTE-mapped compound pages. We only need to avoid
> handling them more than once: lock/unlock page only once if it's present
> in the PMD range multiple times as it handled on compound level. The
> same goes for LRU isolation and putback.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/khugepaged.c | 103 ++++++++++++++++++++++++++++++++----------------
> 1 file changed, 68 insertions(+), 35 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 1e7e6543ebca..49e56e4e30d1 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -515,23 +515,37 @@ void __khugepaged_exit(struct mm_struct *mm)
>
> static void release_pte_page(struct page *page)
> {
> - dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
> + mod_node_page_state(page_pgdat(page),
> + NR_ISOLATED_ANON + page_is_file_cache(page),
> + -compound_nr(page));
> unlock_page(page);
> putback_lru_page(page);
> }
>
> -static void release_pte_pages(pte_t *pte, pte_t *_pte)
> +static void release_pte_pages(pte_t *pte, pte_t *_pte,
> + struct list_head *compound_pagelist)
> {
> + struct page *page, *tmp;
> +
> while (--_pte >= pte) {
> pte_t pteval = *_pte;
> - if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)))
> - release_pte_page(pte_page(pteval));
> +
> + page = pte_page(pteval);
> + if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)) &&
> + !PageCompound(page))
> + release_pte_page(page);
> + }
> +
> + list_for_each_entry_safe(page, tmp, compound_pagelist, lru) {
> + list_del(&page->lru);
> + release_pte_page(page);
> }
> }
>
> static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> unsigned long address,
> - pte_t *pte)
> + pte_t *pte,
> + struct list_head *compound_pagelist)
> {
> struct page *page = NULL;
> pte_t *_pte;
> @@ -561,13 +575,21 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> goto out;
> }
>
> - /* TODO: teach khugepaged to collapse THP mapped with pte */
> + VM_BUG_ON_PAGE(!PageAnon(page), page);
> +
> if (PageCompound(page)) {
> - result = SCAN_PAGE_COMPOUND;
> - goto out;
> - }
> + struct page *p;
> + page = compound_head(page);
>
> - VM_BUG_ON_PAGE(!PageAnon(page), page);
> + /*
> + * Check if we have dealt with the compound page
> + * already
> + */
> + list_for_each_entry(p, compound_pagelist, lru) {
> + if (page == p)
> + goto next;
> + }
> + }
>
> /*
> * We can do it before isolate_lru_page because the
> @@ -597,19 +619,15 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> result = SCAN_PAGE_COUNT;
> goto out;
> }
> - if (pte_write(pteval)) {
> - writable = true;
> - } else {
> - if (PageSwapCache(page) &&
> - !reuse_swap_page(page, NULL)) {
> - unlock_page(page);
> - result = SCAN_SWAP_CACHE_PAGE;
> - goto out;
> - }
> + if (!pte_write(pteval) && PageSwapCache(page) &&
> + !reuse_swap_page(page, NULL)) {
> /*
> - * Page is not in the swap cache. It can be collapsed
> - * into a THP.
> + * Page is in the swap cache and cannot be re-used.
> + * It cannot be collapsed into a THP.
> */
> + unlock_page(page);
> + result = SCAN_SWAP_CACHE_PAGE;
> + goto out;
> }
>
> /*
> @@ -621,16 +639,23 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> result = SCAN_DEL_PAGE_LRU;
> goto out;
> }
> - inc_node_page_state(page,
> - NR_ISOLATED_ANON + page_is_file_cache(page));
> + mod_node_page_state(page_pgdat(page),
> + NR_ISOLATED_ANON + page_is_file_cache(page),
> + compound_nr(page));
> VM_BUG_ON_PAGE(!PageLocked(page), page);
> VM_BUG_ON_PAGE(PageLRU(page), page);
>
> + if (PageCompound(page))
> + list_add_tail(&page->lru, compound_pagelist);
> +next:
> /* There should be enough young pte to collapse the page */
> if (pte_young(pteval) ||
> page_is_young(page) || PageReferenced(page) ||
> mmu_notifier_test_young(vma->vm_mm, address))
> referenced++;
> +
> + if (pte_write(pteval))
> + writable = true;
> }
> if (likely(writable)) {
> if (likely(referenced)) {
> @@ -644,7 +669,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> }
>
> out:
> - release_pte_pages(pte, _pte);
> + release_pte_pages(pte, _pte, compound_pagelist);
> trace_mm_collapse_huge_page_isolate(page, none_or_zero,
> referenced, writable, result);
> return 0;
> @@ -653,13 +678,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> struct vm_area_struct *vma,
> unsigned long address,
> - spinlock_t *ptl)
> + spinlock_t *ptl,
> + struct list_head *compound_pagelist)
> {
> + struct page *src_page, *tmp;
> pte_t *_pte;
> for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> _pte++, page++, address += PAGE_SIZE) {
> pte_t pteval = *_pte;
> - struct page *src_page;
>
> if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> clear_user_highpage(page, address);
> @@ -679,7 +705,6 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> } else {
> src_page = pte_page(pteval);
> copy_user_highpage(page, src_page, address, vma);
> - release_pte_page(src_page);
> /*
> * ptl mostly unnecessary, but preempt has to
> * be disabled to update the per-cpu stats
> @@ -693,9 +718,18 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> pte_clear(vma->vm_mm, address, _pte);
> page_remove_rmap(src_page, false);
> spin_unlock(ptl);
> - free_page_and_swap_cache(src_page);
> + if (!PageCompound(src_page)) {
> + release_pte_page(src_page);
> + free_page_and_swap_cache(src_page);
> + }
> }
> }
> +
> + list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) {
> + list_del(&src_page->lru);
> + release_pte_page(src_page);
> + free_page_and_swap_cache(src_page);
> + }

It looks this may mess up the PTE-mapped THP's refcount if it has
multiple PTE-mapped subpages since put_page() is not called for every
PTE-mapped subpages.

> }
>
> static void khugepaged_alloc_sleep(void)
> @@ -960,6 +994,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> struct page **hpage,
> int node, int referenced)
> {
> + LIST_HEAD(compound_pagelist);
> pmd_t *pmd, _pmd;
> pte_t *pte;
> pgtable_t pgtable;
> @@ -1059,7 +1094,8 @@ static void collapse_huge_page(struct mm_struct *mm,
> mmu_notifier_invalidate_range_end(&range);
>
> spin_lock(pte_ptl);
> - isolated = __collapse_huge_page_isolate(vma, address, pte);
> + isolated = __collapse_huge_page_isolate(vma, address, pte,
> + &compound_pagelist);
> spin_unlock(pte_ptl);
>
> if (unlikely(!isolated)) {
> @@ -1084,7 +1120,8 @@ static void collapse_huge_page(struct mm_struct *mm,
> */
> anon_vma_unlock_write(vma->anon_vma);
>
> - __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl);
> + __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl,
> + &compound_pagelist);
> pte_unmap(pte);
> __SetPageUptodate(new_page);
> pgtable = pmd_pgtable(_pmd);
> @@ -1181,11 +1218,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> goto out_unmap;
> }
>
> - /* TODO: teach khugepaged to collapse THP mapped with pte */
> - if (PageCompound(page)) {
> - result = SCAN_PAGE_COMPOUND;
> - goto out_unmap;
> - }
> + page = compound_head(page);
>
> /*
> * Record which node the original page is from and save this

2020-04-08 13:53:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 6/8] khugepaged: Allow to collapse PTE-mapped compound pages

On Mon, Apr 06, 2020 at 02:29:35PM -0700, Yang Shi wrote:
>
>
> On 4/3/20 4:29 AM, Kirill A. Shutemov wrote:
> > We can collapse PTE-mapped compound pages. We only need to avoid
> > handling them more than once: lock/unlock page only once if it's present
> > in the PMD range multiple times as it handled on compound level. The
> > same goes for LRU isolation and putback.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > mm/khugepaged.c | 103 ++++++++++++++++++++++++++++++++----------------
> > 1 file changed, 68 insertions(+), 35 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 1e7e6543ebca..49e56e4e30d1 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -515,23 +515,37 @@ void __khugepaged_exit(struct mm_struct *mm)
> > static void release_pte_page(struct page *page)
> > {
> > - dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
> > + mod_node_page_state(page_pgdat(page),
> > + NR_ISOLATED_ANON + page_is_file_cache(page),
> > + -compound_nr(page));
> > unlock_page(page);
> > putback_lru_page(page);
> > }
> > -static void release_pte_pages(pte_t *pte, pte_t *_pte)
> > +static void release_pte_pages(pte_t *pte, pte_t *_pte,
> > + struct list_head *compound_pagelist)
> > {
> > + struct page *page, *tmp;
> > +
> > while (--_pte >= pte) {
> > pte_t pteval = *_pte;
> > - if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)))
> > - release_pte_page(pte_page(pteval));
> > +
> > + page = pte_page(pteval);
> > + if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)) &&
> > + !PageCompound(page))
> > + release_pte_page(page);
> > + }
> > +
> > + list_for_each_entry_safe(page, tmp, compound_pagelist, lru) {
> > + list_del(&page->lru);
> > + release_pte_page(page);
> > }
> > }
> > static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > unsigned long address,
> > - pte_t *pte)
> > + pte_t *pte,
> > + struct list_head *compound_pagelist)
> > {
> > struct page *page = NULL;
> > pte_t *_pte;
> > @@ -561,13 +575,21 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > goto out;
> > }
> > - /* TODO: teach khugepaged to collapse THP mapped with pte */
> > + VM_BUG_ON_PAGE(!PageAnon(page), page);
> > +
> > if (PageCompound(page)) {
> > - result = SCAN_PAGE_COMPOUND;
> > - goto out;
> > - }
> > + struct page *p;
> > + page = compound_head(page);
> > - VM_BUG_ON_PAGE(!PageAnon(page), page);
> > + /*
> > + * Check if we have dealt with the compound page
> > + * already
> > + */
> > + list_for_each_entry(p, compound_pagelist, lru) {
> > + if (page == p)
> > + goto next;
> > + }
> > + }
> > /*
> > * We can do it before isolate_lru_page because the
> > @@ -597,19 +619,15 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > result = SCAN_PAGE_COUNT;
> > goto out;
> > }
> > - if (pte_write(pteval)) {
> > - writable = true;
> > - } else {
> > - if (PageSwapCache(page) &&
> > - !reuse_swap_page(page, NULL)) {
> > - unlock_page(page);
> > - result = SCAN_SWAP_CACHE_PAGE;
> > - goto out;
> > - }
> > + if (!pte_write(pteval) && PageSwapCache(page) &&
> > + !reuse_swap_page(page, NULL)) {
> > /*
> > - * Page is not in the swap cache. It can be collapsed
> > - * into a THP.
> > + * Page is in the swap cache and cannot be re-used.
> > + * It cannot be collapsed into a THP.
> > */
> > + unlock_page(page);
> > + result = SCAN_SWAP_CACHE_PAGE;
> > + goto out;
> > }
> > /*
> > @@ -621,16 +639,23 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > result = SCAN_DEL_PAGE_LRU;
> > goto out;
> > }
> > - inc_node_page_state(page,
> > - NR_ISOLATED_ANON + page_is_file_cache(page));
> > + mod_node_page_state(page_pgdat(page),
> > + NR_ISOLATED_ANON + page_is_file_cache(page),
> > + compound_nr(page));
> > VM_BUG_ON_PAGE(!PageLocked(page), page);
> > VM_BUG_ON_PAGE(PageLRU(page), page);
> > + if (PageCompound(page))
> > + list_add_tail(&page->lru, compound_pagelist);
> > +next:
> > /* There should be enough young pte to collapse the page */
> > if (pte_young(pteval) ||
> > page_is_young(page) || PageReferenced(page) ||
> > mmu_notifier_test_young(vma->vm_mm, address))
> > referenced++;
> > +
> > + if (pte_write(pteval))
> > + writable = true;
> > }
> > if (likely(writable)) {
> > if (likely(referenced)) {
> > @@ -644,7 +669,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > }
> > out:
> > - release_pte_pages(pte, _pte);
> > + release_pte_pages(pte, _pte, compound_pagelist);
> > trace_mm_collapse_huge_page_isolate(page, none_or_zero,
> > referenced, writable, result);
> > return 0;
> > @@ -653,13 +678,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > struct vm_area_struct *vma,
> > unsigned long address,
> > - spinlock_t *ptl)
> > + spinlock_t *ptl,
> > + struct list_head *compound_pagelist)
> > {
> > + struct page *src_page, *tmp;
> > pte_t *_pte;
> > for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > _pte++, page++, address += PAGE_SIZE) {
> > pte_t pteval = *_pte;
> > - struct page *src_page;
> > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > clear_user_highpage(page, address);
> > @@ -679,7 +705,6 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > } else {
> > src_page = pte_page(pteval);
> > copy_user_highpage(page, src_page, address, vma);
> > - release_pte_page(src_page);
> > /*
> > * ptl mostly unnecessary, but preempt has to
> > * be disabled to update the per-cpu stats
> > @@ -693,9 +718,18 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > pte_clear(vma->vm_mm, address, _pte);
> > page_remove_rmap(src_page, false);
> > spin_unlock(ptl);
> > - free_page_and_swap_cache(src_page);
> > + if (!PageCompound(src_page)) {
> > + release_pte_page(src_page);
> > + free_page_and_swap_cache(src_page);
> > + }
> > }
> > }
> > +
> > + list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) {
> > + list_del(&src_page->lru);
> > + release_pte_page(src_page);
> > + free_page_and_swap_cache(src_page);
> > + }
>
> It looks this may mess up the PTE-mapped THP's refcount if it has multiple
> PTE-mapped subpages since put_page() is not called for every PTE-mapped
> subpages.

Good catch!

I *think* something like this should fix the issue (untested):

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index bfb6155f1d69..9a96f9bff798 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -570,6 +570,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
list_for_each_entry_safe(page, tmp, compound_pagelist, lru) {
list_del(&page->lru);
release_pte_page(page);
+ put_page(page);
}
}

@@ -682,8 +683,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageLRU(page), page);

- if (PageCompound(page))
+ if (PageCompound(page)) {
list_add_tail(&page->lru, compound_pagelist);
+ get_page(page);
+ }
next:
/* There should be enough young pte to collapse the page */
if (pte_young(pteval) ||
@@ -742,6 +745,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
} else {
src_page = pte_page(pteval);
copy_user_highpage(page, src_page, address, vma);
+ if (!PageCompound(src_page))
+ release_pte_page(src_page);
/*
* ptl mostly unnecessary, but preempt has to
* be disabled to update the per-cpu stats
@@ -755,10 +760,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
pte_clear(vma->vm_mm, address, _pte);
page_remove_rmap(src_page, false);
spin_unlock(ptl);
- if (!PageCompound(src_page)) {
- release_pte_page(src_page);
- free_page_and_swap_cache(src_page);
- }
+ free_page_and_swap_cache(src_page);
}
}

--
Kirill A. Shutemov

2020-04-08 15:24:36

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 1/8] khugepaged: Add self test

On Mon, Apr 06, 2020 at 02:50:32PM -0400, Zi Yan wrote:
> khugepaged-58 [001] .... 9913.990380: mm_khugepaged_scan_pmd: mm=00000000283d31fc, scan_pfn=0x2ae4bd, writable=0, referenced=512, none_or_zero=0, status=no_writable_page, unmapped=0

Looks like all failures due to lack of writable ptes. That's very strange
because we write to the page on fill.

You've mentioned that you run it in VM. I wounder if it can be a
virtualizaiton artefact. I run tests under KVM and they are fine. What is
your virtualization setup?

--
Kirill A. Shutemov

2020-04-08 15:55:23

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCHv2 1/8] khugepaged: Add self test

On 8 Apr 2020, at 10:21, Kirill A. Shutemov wrote:

> External email: Use caution opening links or attachments
>
>
> On Mon, Apr 06, 2020 at 02:50:32PM -0400, Zi Yan wrote:
>> khugepaged-58 [001] .... 9913.990380: mm_khugepaged_scan_pmd: mm=00000000283d31fc, scan_pfn=0x2ae4bd, writable=0, referenced=512, none_or_zero=0, status=no_writable_page, unmapped=0
>
> Looks like all failures due to lack of writable ptes. That's very strange
> because we write to the page on fill.
>
> You've mentioned that you run it in VM. I wounder if it can be a
> virtualizaiton artefact. I run tests under KVM and they are fine. What is
> your virtualization setup?

My qemu cmd: “qemu-system-x86_64 -kernel ~/repos/linux/arch/x86/boot/bzImage -hda ~/qemu-image/vm.qcow2 -append "root=/dev/sda1 rw console=ttyS0" -pidfile vm.pid -net user,hostfwd=tcp::11022-:22 -net nic -numa node,nodeid=0 -numa node,nodeid=1 -m 20g -smp 8 -cpu host -enable-kvm -nographic”

QEMU version is 4.2.0 (Debian 1:4.2-3)

The patches are applied on top of commit a10c9c710f9ecea87b9f4bbb837467893b4bef01 from Linus’s tree.


In addition, I tested it again on a bare metal, then all tests passed except “Collapse with max_ptes_swap pages swapped out”, which failed most of time but succeeded occasionally.

When it fails, the tracing info is (seems like the same PTE not writable issue):

# sudo ./trace.sh
Save THP and khugepaged settings... OK
Adjust settings... OK
Swapout 64 of 512 pages... OK
Collapse with max_ptes_swap pages swapped out.... Fail
Restore THP and khugepaged settings... OK
# tracer: nop
#
# entries-in-buffer/entries-written: 42/42 #P:48
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
khugepaged-265 [002] ...1 1007.308366: mm_collapse_huge_page_swapin: mm=000000007c6cc479, swapped_in=64, referenced=448, ret=1
khugepaged-265 [002] ...2 1007.308403: mm_collapse_huge_page_isolate: scan_pfn=0x1faa6ee, none_or_zero=0, referenced=60, writable=0, status=not_suitable_page_count
khugepaged-265 [002] ...1 1007.308404: mm_collapse_huge_page: mm=000000007c6cc479, isolated=0, status=failed
khugepaged-265 [002] ...1 1007.308405: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1fafa45, writable=1, referenced=448, none_or_zero=0, status=succeeded, unmapped=64
khugepaged-265 [002] ...1 1007.328249: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.344263: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.360263: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.376241: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.392237: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.408284: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.424302: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.440284: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.456285: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.472280: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.488282: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.504264: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.520293: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.536282: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.552278: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.568302: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.584290: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.600300: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.616292: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.632290: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.648304: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.664305: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.680289: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.696294: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.712294: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.728309: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.744291: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.760288: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.776312: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.792288: mm_khugepaged_scan_pmd: mm=000000007c6cc479, scan_pfn=0x1faa6ee, writable=0, referenced=60, none_or_zero=0, status=not_suitable_page_count, unmapped=0
khugepaged-265 [002] ...1 1007.811482: mm_khugepaged_scan_pmd: mm=000000006aaca29d, scan_pfn=0xffffffffffffffff, writable=0, referenced=0, none_or_zero=0, status=pmd_null, unmapped=0
khugepaged-265 [002] ...1 1007.811482: mm_khugepaged_scan_pmd: mm=000000006aaca29d, scan_pfn=0xffffffffffffffff, writable=0, referenced=0, none_or_zero=0, status=pmd_null, unmapped=0
khugepaged-265 [002] ...1 1007.811483: mm_khugepaged_scan_pmd: mm=000000006aaca29d, scan_pfn=0xffffffffffffffff, writable=0, referenced=0, none_or_zero=0, status=pmd_null, unmapped=0
khugepaged-265 [002] ...1 1007.811483: mm_khugepaged_scan_pmd: mm=000000006aaca29d, scan_pfn=0xffffffffffffffff, writable=0, referenced=0, none_or_zero=0, status=pmd_null, unmapped=0
khugepaged-265 [002] ...1 1007.811483: mm_khugepaged_scan_pmd: mm=000000006aaca29d, scan_pfn=0xffffffffffffffff, writable=0, referenced=0, none_or_zero=0, status=pmd_null, unmapped=0
khugepaged-265 [002] ...1 1007.811484: mm_khugepaged_scan_pmd: mm=000000006aaca29d, scan_pfn=0xffffffffffffffff, writable=0, referenced=0, none_or_zero=0, status=pmd_null, unmapped=0
khugepaged-265 [002] ...1 1007.811502: mm_khugepaged_scan_pmd: mm=000000005268d9a7, scan_pfn=0xffffffffffffffff, writable=0, referenced=0, none_or_zero=0, status=pmd_null, unmapped=0
khugepaged-265 [002] ...1 1007.811502: mm_khugepaged_scan_pmd: mm=000000005268d9a7, scan_pfn=0xffffffffffffffff, writable=0, referenced=0, none_or_zero=0, status=pmd_null, unmapped=0



Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-04-08 20:28:35

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCHv2 6/8] khugepaged: Allow to collapse PTE-mapped compound pages



On 4/8/20 6:29 AM, Kirill A. Shutemov wrote:
> On Mon, Apr 06, 2020 at 02:29:35PM -0700, Yang Shi wrote:
>>
>> On 4/3/20 4:29 AM, Kirill A. Shutemov wrote:
>>> We can collapse PTE-mapped compound pages. We only need to avoid
>>> handling them more than once: lock/unlock page only once if it's present
>>> in the PMD range multiple times as it handled on compound level. The
>>> same goes for LRU isolation and putback.
>>>
>>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>>> ---
>>> mm/khugepaged.c | 103 ++++++++++++++++++++++++++++++++----------------
>>> 1 file changed, 68 insertions(+), 35 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 1e7e6543ebca..49e56e4e30d1 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -515,23 +515,37 @@ void __khugepaged_exit(struct mm_struct *mm)
>>> static void release_pte_page(struct page *page)
>>> {
>>> - dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
>>> + mod_node_page_state(page_pgdat(page),
>>> + NR_ISOLATED_ANON + page_is_file_cache(page),
>>> + -compound_nr(page));
>>> unlock_page(page);
>>> putback_lru_page(page);
>>> }
>>> -static void release_pte_pages(pte_t *pte, pte_t *_pte)
>>> +static void release_pte_pages(pte_t *pte, pte_t *_pte,
>>> + struct list_head *compound_pagelist)
>>> {
>>> + struct page *page, *tmp;
>>> +
>>> while (--_pte >= pte) {
>>> pte_t pteval = *_pte;
>>> - if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)))
>>> - release_pte_page(pte_page(pteval));
>>> +
>>> + page = pte_page(pteval);
>>> + if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)) &&
>>> + !PageCompound(page))
>>> + release_pte_page(page);
>>> + }
>>> +
>>> + list_for_each_entry_safe(page, tmp, compound_pagelist, lru) {
>>> + list_del(&page->lru);
>>> + release_pte_page(page);
>>> }
>>> }
>>> static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>> unsigned long address,
>>> - pte_t *pte)
>>> + pte_t *pte,
>>> + struct list_head *compound_pagelist)
>>> {
>>> struct page *page = NULL;
>>> pte_t *_pte;
>>> @@ -561,13 +575,21 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>> goto out;
>>> }
>>> - /* TODO: teach khugepaged to collapse THP mapped with pte */
>>> + VM_BUG_ON_PAGE(!PageAnon(page), page);
>>> +
>>> if (PageCompound(page)) {
>>> - result = SCAN_PAGE_COMPOUND;
>>> - goto out;
>>> - }
>>> + struct page *p;
>>> + page = compound_head(page);
>>> - VM_BUG_ON_PAGE(!PageAnon(page), page);
>>> + /*
>>> + * Check if we have dealt with the compound page
>>> + * already
>>> + */
>>> + list_for_each_entry(p, compound_pagelist, lru) {
>>> + if (page == p)
>>> + goto next;
>>> + }
>>> + }
>>> /*
>>> * We can do it before isolate_lru_page because the
>>> @@ -597,19 +619,15 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>> result = SCAN_PAGE_COUNT;
>>> goto out;
>>> }
>>> - if (pte_write(pteval)) {
>>> - writable = true;
>>> - } else {
>>> - if (PageSwapCache(page) &&
>>> - !reuse_swap_page(page, NULL)) {
>>> - unlock_page(page);
>>> - result = SCAN_SWAP_CACHE_PAGE;
>>> - goto out;
>>> - }
>>> + if (!pte_write(pteval) && PageSwapCache(page) &&
>>> + !reuse_swap_page(page, NULL)) {
>>> /*
>>> - * Page is not in the swap cache. It can be collapsed
>>> - * into a THP.
>>> + * Page is in the swap cache and cannot be re-used.
>>> + * It cannot be collapsed into a THP.
>>> */
>>> + unlock_page(page);
>>> + result = SCAN_SWAP_CACHE_PAGE;
>>> + goto out;
>>> }
>>> /*
>>> @@ -621,16 +639,23 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>> result = SCAN_DEL_PAGE_LRU;
>>> goto out;
>>> }
>>> - inc_node_page_state(page,
>>> - NR_ISOLATED_ANON + page_is_file_cache(page));
>>> + mod_node_page_state(page_pgdat(page),
>>> + NR_ISOLATED_ANON + page_is_file_cache(page),
>>> + compound_nr(page));
>>> VM_BUG_ON_PAGE(!PageLocked(page), page);
>>> VM_BUG_ON_PAGE(PageLRU(page), page);
>>> + if (PageCompound(page))
>>> + list_add_tail(&page->lru, compound_pagelist);
>>> +next:
>>> /* There should be enough young pte to collapse the page */
>>> if (pte_young(pteval) ||
>>> page_is_young(page) || PageReferenced(page) ||
>>> mmu_notifier_test_young(vma->vm_mm, address))
>>> referenced++;
>>> +
>>> + if (pte_write(pteval))
>>> + writable = true;
>>> }
>>> if (likely(writable)) {
>>> if (likely(referenced)) {
>>> @@ -644,7 +669,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>> }
>>> out:
>>> - release_pte_pages(pte, _pte);
>>> + release_pte_pages(pte, _pte, compound_pagelist);
>>> trace_mm_collapse_huge_page_isolate(page, none_or_zero,
>>> referenced, writable, result);
>>> return 0;
>>> @@ -653,13 +678,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>> static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
>>> struct vm_area_struct *vma,
>>> unsigned long address,
>>> - spinlock_t *ptl)
>>> + spinlock_t *ptl,
>>> + struct list_head *compound_pagelist)
>>> {
>>> + struct page *src_page, *tmp;
>>> pte_t *_pte;
>>> for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
>>> _pte++, page++, address += PAGE_SIZE) {
>>> pte_t pteval = *_pte;
>>> - struct page *src_page;
>>> if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>> clear_user_highpage(page, address);
>>> @@ -679,7 +705,6 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
>>> } else {
>>> src_page = pte_page(pteval);
>>> copy_user_highpage(page, src_page, address, vma);
>>> - release_pte_page(src_page);
>>> /*
>>> * ptl mostly unnecessary, but preempt has to
>>> * be disabled to update the per-cpu stats
>>> @@ -693,9 +718,18 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
>>> pte_clear(vma->vm_mm, address, _pte);
>>> page_remove_rmap(src_page, false);
>>> spin_unlock(ptl);
>>> - free_page_and_swap_cache(src_page);
>>> + if (!PageCompound(src_page)) {
>>> + release_pte_page(src_page);
>>> + free_page_and_swap_cache(src_page);
>>> + }
>>> }
>>> }
>>> +
>>> + list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) {
>>> + list_del(&src_page->lru);
>>> + release_pte_page(src_page);
>>> + free_page_and_swap_cache(src_page);
>>> + }
>> It looks this may mess up the PTE-mapped THP's refcount if it has multiple
>> PTE-mapped subpages since put_page() is not called for every PTE-mapped
>> subpages.
> Good catch!
>
> I *think* something like this should fix the issue (untested):

Is this an incremental fix?

>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index bfb6155f1d69..9a96f9bff798 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -570,6 +570,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
> list_for_each_entry_safe(page, tmp, compound_pagelist, lru) {
> list_del(&page->lru);
> release_pte_page(page);
> + put_page(page);
> }
> }
>
> @@ -682,8 +683,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> VM_BUG_ON_PAGE(!PageLocked(page), page);
> VM_BUG_ON_PAGE(PageLRU(page), page);
>
> - if (PageCompound(page))
> + if (PageCompound(page)) {
> list_add_tail(&page->lru, compound_pagelist);
> + get_page(page);

I don't get why we have to bump refcount here? The THP won't go away
since it is pinned by isolation. I'm supposed calling
free_page_and_swap_cache(src_page) in each loop looks good enough to dec
refcount for each PTE-mapped subpage.

Then calling release_pte_page() for each entry on compund_pagelist
outside the loop to release the pin from isolation. The compound page
would get freed via pagevec finally.

> + }
> next:
> /* There should be enough young pte to collapse the page */
> if (pte_young(pteval) ||
> @@ -742,6 +745,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> } else {
> src_page = pte_page(pteval);
> copy_user_highpage(page, src_page, address, vma);
> + if (!PageCompound(src_page))
> + release_pte_page(src_page);
> /*
> * ptl mostly unnecessary, but preempt has to
> * be disabled to update the per-cpu stats
> @@ -755,10 +760,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> pte_clear(vma->vm_mm, address, _pte);
> page_remove_rmap(src_page, false);
> spin_unlock(ptl);
> - if (!PageCompound(src_page)) {
> - release_pte_page(src_page);
> - free_page_and_swap_cache(src_page);
> - }
> + free_page_and_swap_cache(src_page);
> }
> }
>

2020-04-09 13:50:05

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 6/8] khugepaged: Allow to collapse PTE-mapped compound pages

On Wed, Apr 08, 2020 at 11:57:27AM -0700, Yang Shi wrote:
>
>
> On 4/8/20 6:29 AM, Kirill A. Shutemov wrote:
> > On Mon, Apr 06, 2020 at 02:29:35PM -0700, Yang Shi wrote:
> > >
> > > On 4/3/20 4:29 AM, Kirill A. Shutemov wrote:
> > > > We can collapse PTE-mapped compound pages. We only need to avoid
> > > > handling them more than once: lock/unlock page only once if it's present
> > > > in the PMD range multiple times as it handled on compound level. The
> > > > same goes for LRU isolation and putback.
> > > >
> > > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > > ---
> > > > mm/khugepaged.c | 103 ++++++++++++++++++++++++++++++++----------------
> > > > 1 file changed, 68 insertions(+), 35 deletions(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 1e7e6543ebca..49e56e4e30d1 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -515,23 +515,37 @@ void __khugepaged_exit(struct mm_struct *mm)
> > > > static void release_pte_page(struct page *page)
> > > > {
> > > > - dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
> > > > + mod_node_page_state(page_pgdat(page),
> > > > + NR_ISOLATED_ANON + page_is_file_cache(page),
> > > > + -compound_nr(page));
> > > > unlock_page(page);
> > > > putback_lru_page(page);
> > > > }
> > > > -static void release_pte_pages(pte_t *pte, pte_t *_pte)
> > > > +static void release_pte_pages(pte_t *pte, pte_t *_pte,
> > > > + struct list_head *compound_pagelist)
> > > > {
> > > > + struct page *page, *tmp;
> > > > +
> > > > while (--_pte >= pte) {
> > > > pte_t pteval = *_pte;
> > > > - if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)))
> > > > - release_pte_page(pte_page(pteval));
> > > > +
> > > > + page = pte_page(pteval);
> > > > + if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)) &&
> > > > + !PageCompound(page))
> > > > + release_pte_page(page);
> > > > + }
> > > > +
> > > > + list_for_each_entry_safe(page, tmp, compound_pagelist, lru) {
> > > > + list_del(&page->lru);
> > > > + release_pte_page(page);
> > > > }
> > > > }
> > > > static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > > > unsigned long address,
> > > > - pte_t *pte)
> > > > + pte_t *pte,
> > > > + struct list_head *compound_pagelist)
> > > > {
> > > > struct page *page = NULL;
> > > > pte_t *_pte;
> > > > @@ -561,13 +575,21 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > > > goto out;
> > > > }
> > > > - /* TODO: teach khugepaged to collapse THP mapped with pte */
> > > > + VM_BUG_ON_PAGE(!PageAnon(page), page);
> > > > +
> > > > if (PageCompound(page)) {
> > > > - result = SCAN_PAGE_COMPOUND;
> > > > - goto out;
> > > > - }
> > > > + struct page *p;
> > > > + page = compound_head(page);
> > > > - VM_BUG_ON_PAGE(!PageAnon(page), page);
> > > > + /*
> > > > + * Check if we have dealt with the compound page
> > > > + * already
> > > > + */
> > > > + list_for_each_entry(p, compound_pagelist, lru) {
> > > > + if (page == p)
> > > > + goto next;
> > > > + }
> > > > + }
> > > > /*
> > > > * We can do it before isolate_lru_page because the
> > > > @@ -597,19 +619,15 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > > > result = SCAN_PAGE_COUNT;
> > > > goto out;
> > > > }
> > > > - if (pte_write(pteval)) {
> > > > - writable = true;
> > > > - } else {
> > > > - if (PageSwapCache(page) &&
> > > > - !reuse_swap_page(page, NULL)) {
> > > > - unlock_page(page);
> > > > - result = SCAN_SWAP_CACHE_PAGE;
> > > > - goto out;
> > > > - }
> > > > + if (!pte_write(pteval) && PageSwapCache(page) &&
> > > > + !reuse_swap_page(page, NULL)) {
> > > > /*
> > > > - * Page is not in the swap cache. It can be collapsed
> > > > - * into a THP.
> > > > + * Page is in the swap cache and cannot be re-used.
> > > > + * It cannot be collapsed into a THP.
> > > > */
> > > > + unlock_page(page);
> > > > + result = SCAN_SWAP_CACHE_PAGE;
> > > > + goto out;
> > > > }
> > > > /*
> > > > @@ -621,16 +639,23 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > > > result = SCAN_DEL_PAGE_LRU;
> > > > goto out;
> > > > }
> > > > - inc_node_page_state(page,
> > > > - NR_ISOLATED_ANON + page_is_file_cache(page));
> > > > + mod_node_page_state(page_pgdat(page),
> > > > + NR_ISOLATED_ANON + page_is_file_cache(page),
> > > > + compound_nr(page));
> > > > VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > > VM_BUG_ON_PAGE(PageLRU(page), page);
> > > > + if (PageCompound(page))
> > > > + list_add_tail(&page->lru, compound_pagelist);
> > > > +next:
> > > > /* There should be enough young pte to collapse the page */
> > > > if (pte_young(pteval) ||
> > > > page_is_young(page) || PageReferenced(page) ||
> > > > mmu_notifier_test_young(vma->vm_mm, address))
> > > > referenced++;
> > > > +
> > > > + if (pte_write(pteval))
> > > > + writable = true;
> > > > }
> > > > if (likely(writable)) {
> > > > if (likely(referenced)) {
> > > > @@ -644,7 +669,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > > > }
> > > > out:
> > > > - release_pte_pages(pte, _pte);
> > > > + release_pte_pages(pte, _pte, compound_pagelist);
> > > > trace_mm_collapse_huge_page_isolate(page, none_or_zero,
> > > > referenced, writable, result);
> > > > return 0;
> > > > @@ -653,13 +678,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > > > static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > > > struct vm_area_struct *vma,
> > > > unsigned long address,
> > > > - spinlock_t *ptl)
> > > > + spinlock_t *ptl,
> > > > + struct list_head *compound_pagelist)
> > > > {
> > > > + struct page *src_page, *tmp;
> > > > pte_t *_pte;
> > > > for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> > > > _pte++, page++, address += PAGE_SIZE) {
> > > > pte_t pteval = *_pte;
> > > > - struct page *src_page;
> > > > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > > > clear_user_highpage(page, address);
> > > > @@ -679,7 +705,6 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > > > } else {
> > > > src_page = pte_page(pteval);
> > > > copy_user_highpage(page, src_page, address, vma);
> > > > - release_pte_page(src_page);
> > > > /*
> > > > * ptl mostly unnecessary, but preempt has to
> > > > * be disabled to update the per-cpu stats
> > > > @@ -693,9 +718,18 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > > > pte_clear(vma->vm_mm, address, _pte);
> > > > page_remove_rmap(src_page, false);
> > > > spin_unlock(ptl);
> > > > - free_page_and_swap_cache(src_page);
> > > > + if (!PageCompound(src_page)) {
> > > > + release_pte_page(src_page);
> > > > + free_page_and_swap_cache(src_page);
> > > > + }
> > > > }
> > > > }
> > > > +
> > > > + list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) {
> > > > + list_del(&src_page->lru);
> > > > + release_pte_page(src_page);
> > > > + free_page_and_swap_cache(src_page);
> > > > + }
> > > It looks this may mess up the PTE-mapped THP's refcount if it has multiple
> > > PTE-mapped subpages since put_page() is not called for every PTE-mapped
> > > subpages.
> > Good catch!
> >
> > I *think* something like this should fix the issue (untested):
>
> Is this an incremental fix?

Yes. I will fold it in before the next submission.

> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index bfb6155f1d69..9a96f9bff798 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -570,6 +570,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
> > list_for_each_entry_safe(page, tmp, compound_pagelist, lru) {
> > list_del(&page->lru);
> > release_pte_page(page);
> > + put_page(page);
> > }
> > }
> > @@ -682,8 +683,10 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > VM_BUG_ON_PAGE(!PageLocked(page), page);
> > VM_BUG_ON_PAGE(PageLRU(page), page);
> > - if (PageCompound(page))
> > + if (PageCompound(page)) {
> > list_add_tail(&page->lru, compound_pagelist);
> > + get_page(page);
>
> I don't get why we have to bump refcount here? The THP won't go away since
> it is pinned by isolation. I'm supposed calling
> free_page_and_swap_cache(src_page) in each loop looks good enough to dec
> refcount for each PTE-mapped subpage.
>
> Then calling release_pte_page() for each entry on compund_pagelist outside
> the loop to release the pin from isolation. The compound page would get
> freed via pagevec finally.

Fair enough.

--
Kirill A. Shutemov

2020-04-09 16:36:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 2/8] khugepaged: Do not stop collapse if less than half PTEs are referenced

On Mon, Apr 06, 2020 at 12:53:00PM -0700, Ralph Campbell wrote:
>
> On 4/3/20 4:29 AM, Kirill A. Shutemov wrote:
> > __collapse_huge_page_swapin() check number of referenced PTE to decide
> > if the memory range is hot enough to justify swapin.
> >
> > The problem is that it stops collapse altogether if there's not enough
> > referenced pages, not only swappingin.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Fixes: 0db501f7a34c ("mm, thp: convert from optimistic swapin collapsing to conservative")
> > Reviewed-by: Zi Yan <[email protected]> ---
> > mm/khugepaged.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 99bab7e4d05b..14d7afc90786 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -905,7 +905,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> > /* we only decide to swapin, if there is enough young ptes */
> > if (referenced < HPAGE_PMD_NR/2) {
> > trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
>
> The trace point is recording the return value. Shouldn't you s/0/1 to match?

Good catch.

>
> > - return false;
> > + /* Do not block collapse, only skip swapping in */
> > + return true;
> > }
> > vmf.pte = pte_offset_map(pmd, address);
> > for (; vmf.address < address + HPAGE_PMD_NR*PAGE_SIZE;
> >
>
> So "referenced < HPAGE_PMD_NR/2" means swapped out pages aren't faulted back in
> but there could still be no swapped out pages, just "old" pages so collapse could
> succeed. Seems like this check could have been made in khugepaged_scan_pmd() when
> "referenced" is being computed and "unmapped" is available. Just skip setting
> "ret = 1" if unmapped > 0 && referenced < HPAGE_PMD_NR/2.

Good point. I'll change the code to address it.

--
Kirill A. Shutemov

2020-04-10 11:50:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 1/8] khugepaged: Add self test

On Mon, Apr 06, 2020 at 10:59:52AM -0400, Zi Yan wrote:
> I ran this test with all patches from this series applied to Linus’s tree, but still see several failures. Is it expected?
> The config file is attached. Let me know if I miss anything. BTW, I am running in a VM.
>
> Thanks.
>
> The output:
>
> ➜ ~ sudo ./khugepaged
> Save THP and khugepaged settings... OK
> Adjust settings... OK
> Allocate huge page on fault... OK
> Split huge PMD on MADV_DONTNEED... OK
> Collapse fully populated PTE table.... Fail

I was able to reproduce the issue. And it's fun failure mode.

How did you get the test case inside the VM? Copy-paste source using 'cat'
or something similar inside the VM?

It screwed up CHECK_HUGE_FMT and CHECK_SWAP_FMT for me. Double back slash
was converted to single. As result check_huge() and check_swap() gave the
false-negative result all the time.

Could you check that the source of the test-case is not mangled and
re-test if it is.

--
Kirill A. Shutemov

2020-04-10 14:39:39

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCHv2 1/8] khugepaged: Add self test

On 10 Apr 2020, at 7:47, Kirill A. Shutemov wrote:

> External email: Use caution opening links or attachments
>
>
> On Mon, Apr 06, 2020 at 10:59:52AM -0400, Zi Yan wrote:
>> I ran this test with all patches from this series applied to Linus’s tree, but still see several failures. Is it expected?
>> The config file is attached. Let me know if I miss anything. BTW, I am running in a VM.
>>
>> Thanks.
>>
>> The output:
>>
>> ➜ ~ sudo ./khugepaged
>> Save THP and khugepaged settings... OK
>> Adjust settings... OK
>> Allocate huge page on fault... OK
>> Split huge PMD on MADV_DONTNEED... OK
>> Collapse fully populated PTE table.... Fail
>
> I was able to reproduce the issue. And it's fun failure mode.
>
> How did you get the test case inside the VM? Copy-paste source using 'cat'
> or something similar inside the VM?

First of all, the failure above was from a bare metal and was the only failure I saw, whereas I saw more failures in my VM. The test program was not messed up in either environment.

For VM failures I mentioned before, I used scp to copy the source code into the VM. My VM has its port 22 forwarded to host’s port 11022. “-net user,hostfwd=tcp::11022-:22”. I also copied a binary into my VM and saw the same failures.

I kinda think the failures are not related to your patches but something else.

>
> It screwed up CHECK_HUGE_FMT and CHECK_SWAP_FMT for me. Double back slash
> was converted to single. As result check_huge() and check_swap() gave the
> false-negative result all the time.

It was not my case, since CHECK_HUGE_FMT and CHECK_SWAP_FMT from my khugepaged.c match your patch code.

>
> Could you check that the source of the test-case is not mangled and
> re-test if it is.

I can confirm that the test-case is not mangled. I think it must be my VM setup or kernel configuration.

Do you mind sharing your .config file with me and which kernel commit you apply the patches on top of in your setup? I can look into it and check what the problem is.

Thanks.


--
Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature

2020-04-10 14:59:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2 1/8] khugepaged: Add self test

On Fri, Apr 10, 2020 at 10:36:58AM -0400, Zi Yan wrote:
> On 10 Apr 2020, at 7:47, Kirill A. Shutemov wrote:
>
> > External email: Use caution opening links or attachments
> >
> >
> > On Mon, Apr 06, 2020 at 10:59:52AM -0400, Zi Yan wrote:
> >> I ran this test with all patches from this series applied to Linus’s tree, but still see several failures. Is it expected?
> >> The config file is attached. Let me know if I miss anything. BTW, I am running in a VM.
> >>
> >> Thanks.
> >>
> >> The output:
> >>
> >> ➜ ~ sudo ./khugepaged
> >> Save THP and khugepaged settings... OK
> >> Adjust settings... OK
> >> Allocate huge page on fault... OK
> >> Split huge PMD on MADV_DONTNEED... OK
> >> Collapse fully populated PTE table.... Fail
> >
> > I was able to reproduce the issue. And it's fun failure mode.
> >
> > How did you get the test case inside the VM? Copy-paste source using 'cat'
> > or something similar inside the VM?
>
> First of all, the failure above was from a bare metal and was the only
> failure I saw, whereas I saw more failures in my VM. The test program
> was not messed up in either environment.

Hm. In the quote you are saying "BTW, I am running in a VM".

>
> For VM failures I mentioned before, I used scp to copy the source code
> into the VM. My VM has its port 22 forwarded to host’s port 11022. “-net
> user,hostfwd=tcp::11022-:22”. I also copied a binary into my VM and saw
> the same failures.
>
> I kinda think the failures are not related to your patches but something else.
>
> >
> > It screwed up CHECK_HUGE_FMT and CHECK_SWAP_FMT for me. Double back slash
> > was converted to single. As result check_huge() and check_swap() gave the
> > false-negative result all the time.
>
> It was not my case, since CHECK_HUGE_FMT and CHECK_SWAP_FMT from my
> khugepaged.c match your patch code.
>
> >
> > Could you check that the source of the test-case is not mangled and
> > re-test if it is.
>
> I can confirm that the test-case is not mangled. I think it must be my
> VM setup or kernel configuration.
>
> Do you mind sharing your .config file with me and which kernel commit
> you apply the patches on top of in your setup? I can look into it and
> check what the problem is.

My config is attached.


--
Kirill A. Shutemov


Attachments:
(No filename) (2.29 kB)
.config (132.92 kB)
Download all attachments

2020-04-10 15:06:27

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCHv2 1/8] khugepaged: Add self test

On 10 Apr 2020, at 10:58, Kirill A. Shutemov wrote:

> External email: Use caution opening links or attachments
>
>
> On Fri, Apr 10, 2020 at 10:36:58AM -0400, Zi Yan wrote:
>> On 10 Apr 2020, at 7:47, Kirill A. Shutemov wrote:
>>
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Mon, Apr 06, 2020 at 10:59:52AM -0400, Zi Yan wrote:
>>>> I ran this test with all patches from this series applied to Linus’s tree, but still see several failures. Is it expected?
>>>> The config file is attached. Let me know if I miss anything. BTW, I am running in a VM.
>>>>
>>>> Thanks.
>>>>
>>>> The output:
>>>>
>>>> ➜ ~ sudo ./khugepaged
>>>> Save THP and khugepaged settings... OK
>>>> Adjust settings... OK
>>>> Allocate huge page on fault... OK
>>>> Split huge PMD on MADV_DONTNEED... OK
>>>> Collapse fully populated PTE table.... Fail
>>>
>>> I was able to reproduce the issue. And it's fun failure mode.
>>>
>>> How did you get the test case inside the VM? Copy-paste source using 'cat'
>>> or something similar inside the VM?
>>
>> First of all, the failure above was from a bare metal and was the only
>> failure I saw, whereas I saw more failures in my VM. The test program
>> was not messed up in either environment.
>
> Hm. In the quote you are saying "BTW, I am running in a VM".

Sorry, misread the email thread. I was referring to another my email on running
tests on a bare metal, where only “Collapse with max_ptes_swap pages swapped out”
failed.

Here is the link to the email:
https://lore.kernel.org/linux-mm/[email protected]/

>
>>
>> For VM failures I mentioned before, I used scp to copy the source code
>> into the VM. My VM has its port 22 forwarded to host’s port 11022. “-net
>> user,hostfwd=tcp::11022-:22”. I also copied a binary into my VM and saw
>> the same failures.
>>
>> I kinda think the failures are not related to your patches but something else.
>>
>>>
>>> It screwed up CHECK_HUGE_FMT and CHECK_SWAP_FMT for me. Double back slash
>>> was converted to single. As result check_huge() and check_swap() gave the
>>> false-negative result all the time.
>>
>> It was not my case, since CHECK_HUGE_FMT and CHECK_SWAP_FMT from my
>> khugepaged.c match your patch code.
>>
>>>
>>> Could you check that the source of the test-case is not mangled and
>>> re-test if it is.
>>
>> I can confirm that the test-case is not mangled. I think it must be my
>> VM setup or kernel configuration.
>>
>> Do you mind sharing your .config file with me and which kernel commit
>> you apply the patches on top of in your setup? I can look into it and
>> check what the problem is.
>
> My config is attached.

Thanks.


Best Regards,
Yan Zi


Attachments:
signature.asc (871.00 B)
OpenPGP digital signature