Currently, anonymous PTE-mapped THPs cannot be collapsed in-place:
collapsing (e.g., via MADV_COLLAPSE) implies allocating a fresh THP and
mapping that new THP via a PMD: as it's a fresh anon THP, it will get the
exclusive flag set on the head page and everybody is happy.
However, if the kernel would ever support in-place collapse of anonymous
THPs (replacing a page table mapping each sub-page of a THP via PTEs with a
single PMD mapping the complete THP), exclusivity information stored for
each sub-page would have to be collapsed accordingly:
(1) All PTEs map !exclusive anon sub-pages: the in-place collapsed THP
must not not have the exclusive flag set on the head page mapped by
the PMD. This is the easiest case to handle ("simply don't set any
exclusive flags").
(2) All PTEs map exclusive anon sub-pages: when collapsing, we have to
clear the exclusive flag from all tail pages and only leave the
exclusive flag set for the head page. Otherwise, fork() after
collapse would not clear the exclusive flags from the tail pages
and we'd be in trouble once PTE-mapping the shared THP when writing
to shared tail pages that still have the exclusive flag set. This
would effectively revert what the PTE-mapping code does when
propagating the exclusive flag to all sub-pages.
(3) PTEs map a mixture of exclusive and !exclusive anon sub-pages (can
happen e.g., due to MADV_DONTFORK before fork()). We must not
collapse the THP in-place, otherwise bad things may happen:
the exclusive flags of sub-pages would get ignored and the
exclusive flag of the head page would get used instead.
Now that we have MADV_COLLAPSE in place to trigger collapsing a THP,
let's add some test cases that would bail out early, if we'd
voluntarily/accidantially unlock in-place collapse for anon THPs and
forget about taking proper care of exclusive flags.
Running the test on a kernel with MADV_COLLAPSE support:
# [INFO] Anonymous THP tests
# [RUN] Basic COW after fork() when collapsing before fork()
ok 169 No leak from parent into child
# [RUN] Basic COW after fork() when collapsing after fork() (fully shared)
ok 170 # SKIP MADV_COLLAPSE failed: Invalid argument
# [RUN] Basic COW after fork() when collapsing after fork() (lower shared)
ok 171 No leak from parent into child
# [RUN] Basic COW after fork() when collapsing after fork() (upper shared)
ok 172 No leak from parent into child
For now, MADV_COLLAPSE always seems to fail if all PTEs map shared
sub-pages.
Cc: Andrew Morton <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Zach O'Keefe <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
A patch from Hugh made me explore the wonderful world of in-place collapse
of THP, and I was briefly concerned that it would apply to anon THP as
well. After thinking about it a bit, I decided to add test cases, to better
be safe than sorry in any case, and to document how PG_anon_exclusive is to
be handled in that case.
---
tools/testing/selftests/vm/cow.c | 228 +++++++++++++++++++++++++++++++
1 file changed, 228 insertions(+)
diff --git a/tools/testing/selftests/vm/cow.c b/tools/testing/selftests/vm/cow.c
index 26f6ea3079e2..16216d893d96 100644
--- a/tools/testing/selftests/vm/cow.c
+++ b/tools/testing/selftests/vm/cow.c
@@ -30,6 +30,10 @@
#include "../kselftest.h"
#include "vm_util.h"
+#ifndef MADV_COLLAPSE
+#define MADV_COLLAPSE 25
+#endif
+
static size_t pagesize;
static int pagemap_fd;
static size_t thpsize;
@@ -1178,6 +1182,228 @@ static int tests_per_anon_test_case(void)
return tests;
}
+enum anon_thp_collapse_test {
+ ANON_THP_COLLAPSE_UNSHARED,
+ ANON_THP_COLLAPSE_FULLY_SHARED,
+ ANON_THP_COLLAPSE_LOWER_SHARED,
+ ANON_THP_COLLAPSE_UPPER_SHARED,
+};
+
+static void do_test_anon_thp_collapse(char *mem, size_t size,
+ enum anon_thp_collapse_test test)
+{
+ struct comm_pipes comm_pipes;
+ char buf;
+ int ret;
+
+ ret = setup_comm_pipes(&comm_pipes);
+ if (ret) {
+ ksft_test_result_fail("pipe() failed\n");
+ return;
+ }
+
+ /*
+ * Trigger PTE-mapping the THP by temporarily mapping a single subpage
+ * R/O, such that we can try collapsing it later.
+ */
+ ret = mprotect(mem + pagesize, pagesize, PROT_READ);
+ if (ret) {
+ ksft_test_result_fail("mprotect() failed\n");
+ goto close_comm_pipes;
+ }
+ ret = mprotect(mem + pagesize, pagesize, PROT_READ | PROT_WRITE);
+ if (ret) {
+ ksft_test_result_fail("mprotect() failed\n");
+ goto close_comm_pipes;
+ }
+
+ switch (test) {
+ case ANON_THP_COLLAPSE_UNSHARED:
+ /* Collapse before actually COW-sharing the page. */
+ ret = madvise(mem, size, MADV_COLLAPSE);
+ if (ret) {
+ ksft_test_result_skip("MADV_COLLAPSE failed: %s\n",
+ strerror(errno));
+ goto close_comm_pipes;
+ }
+ break;
+ case ANON_THP_COLLAPSE_FULLY_SHARED:
+ /* COW-share the full PTE-mapped THP. */
+ break;
+ case ANON_THP_COLLAPSE_LOWER_SHARED:
+ /* Don't COW-share the upper part of the THP. */
+ ret = madvise(mem + size / 2, size / 2, MADV_DONTFORK);
+ if (ret) {
+ ksft_test_result_fail("MADV_DONTFORK failed\n");
+ goto close_comm_pipes;
+ }
+ break;
+ case ANON_THP_COLLAPSE_UPPER_SHARED:
+ /* Don't COW-share the lower part of the THP. */
+ ret = madvise(mem, size / 2, MADV_DONTFORK);
+ if (ret) {
+ ksft_test_result_fail("MADV_DONTFORK failed\n");
+ goto close_comm_pipes;
+ }
+ break;
+ default:
+ assert(false);
+ }
+
+ ret = fork();
+ if (ret < 0) {
+ ksft_test_result_fail("fork() failed\n");
+ goto close_comm_pipes;
+ } else if (!ret) {
+ switch (test) {
+ case ANON_THP_COLLAPSE_UNSHARED:
+ case ANON_THP_COLLAPSE_FULLY_SHARED:
+ exit(child_memcmp_fn(mem, size, &comm_pipes));
+ break;
+ case ANON_THP_COLLAPSE_LOWER_SHARED:
+ exit(child_memcmp_fn(mem, size / 2, &comm_pipes));
+ break;
+ case ANON_THP_COLLAPSE_UPPER_SHARED:
+ exit(child_memcmp_fn(mem + size / 2, size / 2,
+ &comm_pipes));
+ break;
+ default:
+ assert(false);
+ }
+ }
+
+ while (read(comm_pipes.child_ready[0], &buf, 1) != 1)
+ ;
+
+ switch (test) {
+ case ANON_THP_COLLAPSE_UNSHARED:
+ break;
+ case ANON_THP_COLLAPSE_UPPER_SHARED:
+ case ANON_THP_COLLAPSE_LOWER_SHARED:
+ /*
+ * Revert MADV_DONTFORK such that we merge the VMAs and are
+ * able to actually collapse.
+ */
+ ret = madvise(mem, size, MADV_DOFORK);
+ if (ret) {
+ ksft_test_result_fail("MADV_DOFORK failed\n");
+ write(comm_pipes.parent_ready[1], "0", 1);
+ wait(&ret);
+ goto close_comm_pipes;
+ }
+ /* FALLTHROUGH */
+ case ANON_THP_COLLAPSE_FULLY_SHARED:
+ /* Collapse before anyone modified the COW-shared page. */
+ ret = madvise(mem, size, MADV_COLLAPSE);
+ if (ret) {
+ ksft_test_result_skip("MADV_COLLAPSE failed: %s\n",
+ strerror(errno));
+ write(comm_pipes.parent_ready[1], "0", 1);
+ wait(&ret);
+ goto close_comm_pipes;
+ }
+ break;
+ default:
+ assert(false);
+ }
+
+ /* Modify the page. */
+ memset(mem, 0xff, size);
+ write(comm_pipes.parent_ready[1], "0", 1);
+
+ wait(&ret);
+ if (WIFEXITED(ret))
+ ret = WEXITSTATUS(ret);
+ else
+ ret = -EINVAL;
+
+ ksft_test_result(!ret, "No leak from parent into child\n");
+close_comm_pipes:
+ close_comm_pipes(&comm_pipes);
+}
+
+static void test_anon_thp_collapse_unshared(char *mem, size_t size)
+{
+ do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_UNSHARED);
+}
+
+static void test_anon_thp_collapse_fully_shared(char *mem, size_t size)
+{
+ do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_FULLY_SHARED);
+}
+
+static void test_anon_thp_collapse_lower_shared(char *mem, size_t size)
+{
+ do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_LOWER_SHARED);
+}
+
+static void test_anon_thp_collapse_upper_shared(char *mem, size_t size)
+{
+ do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_UPPER_SHARED);
+}
+
+/*
+ * Test cases that are specific to anonymous THP: pages in private mappings
+ * that may get shared via COW during fork().
+ */
+static const struct test_case anon_thp_test_cases[] = {
+ /*
+ * Basic COW test for fork() without any GUP when collapsing a THP
+ * before fork().
+ *
+ * Re-mapping a PTE-mapped anon THP using a single PMD ("in-place
+ * collapse") might easily get COW handling wrong when not collapsing
+ * exclusivity information properly.
+ */
+ {
+ "Basic COW after fork() when collapsing before fork()",
+ test_anon_thp_collapse_unshared,
+ },
+ /* Basic COW test, but collapse after COW-sharing a full THP. */
+ {
+ "Basic COW after fork() when collapsing after fork() (fully shared)",
+ test_anon_thp_collapse_fully_shared,
+ },
+ /*
+ * Basic COW test, but collapse after COW-sharing the lower half of a
+ * THP.
+ */
+ {
+ "Basic COW after fork() when collapsing after fork() (lower shared)",
+ test_anon_thp_collapse_lower_shared,
+ },
+ /*
+ * Basic COW test, but collapse after COW-sharing the upper half of a
+ * THP.
+ */
+ {
+ "Basic COW after fork() when collapsing after fork() (upper shared)",
+ test_anon_thp_collapse_upper_shared,
+ },
+};
+
+static void run_anon_thp_test_cases(void)
+{
+ int i;
+
+ if (!thpsize)
+ return;
+
+ ksft_print_msg("[INFO] Anonymous THP tests\n");
+
+ for (i = 0; i < ARRAY_SIZE(anon_thp_test_cases); i++) {
+ struct test_case const *test_case = &anon_thp_test_cases[i];
+
+ ksft_print_msg("[RUN] %s\n", test_case->desc);
+ do_run_with_thp(test_case->fn, THP_RUN_PMD);
+ }
+}
+
+static int tests_per_anon_thp_test_case(void)
+{
+ return thpsize ? 1 : 0;
+}
+
typedef void (*non_anon_test_fn)(char *mem, const char *smem, size_t size);
static void test_cow(char *mem, const char *smem, size_t size)
@@ -1518,6 +1744,7 @@ int main(int argc, char **argv)
ksft_print_header();
ksft_set_plan(ARRAY_SIZE(anon_test_cases) * tests_per_anon_test_case() +
+ ARRAY_SIZE(anon_thp_test_cases) * tests_per_anon_thp_test_case() +
ARRAY_SIZE(non_anon_test_cases) * tests_per_non_anon_test_case());
gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
@@ -1526,6 +1753,7 @@ int main(int argc, char **argv)
ksft_exit_fail_msg("opening pagemap failed\n");
run_anon_test_cases();
+ run_anon_thp_test_cases();
run_non_anon_test_cases();
err = ksft_get_fail_cnt();
--
2.39.0
>
> Currently, anonymous PTE-mapped THPs cannot be collapsed in-place:
> collapsing (e.g., via MADV_COLLAPSE) implies allocating a fresh THP and
> mapping that new THP via a PMD: as it's a fresh anon THP, it will get the
> exclusive flag set on the head page and everybody is happy.
>
> However, if the kernel would ever support in-place collapse of anonymous
> THPs (replacing a page table mapping each sub-page of a THP via PTEs with a
> single PMD mapping the complete THP), exclusivity information stored for
> each sub-page would have to be collapsed accordingly:
>
> (1) All PTEs map !exclusive anon sub-pages: the in-place collapsed THP
> must not not have the exclusive flag set on the head page mapped by
> the PMD. This is the easiest case to handle ("simply don't set any
> exclusive flags").
>
> (2) All PTEs map exclusive anon sub-pages: when collapsing, we have to
> clear the exclusive flag from all tail pages and only leave the
> exclusive flag set for the head page. Otherwise, fork() after
> collapse would not clear the exclusive flags from the tail pages
> and we'd be in trouble once PTE-mapping the shared THP when writing
> to shared tail pages that still have the exclusive flag set. This
> would effectively revert what the PTE-mapping code does when
> propagating the exclusive flag to all sub-pages.
>
> (3) PTEs map a mixture of exclusive and !exclusive anon sub-pages (can
> happen e.g., due to MADV_DONTFORK before fork()). We must not
> collapse the THP in-place, otherwise bad things may happen:
> the exclusive flags of sub-pages would get ignored and the
> exclusive flag of the head page would get used instead.
>
> Now that we have MADV_COLLAPSE in place to trigger collapsing a THP,
> let's add some test cases that would bail out early, if we'd
> voluntarily/accidantially unlock in-place collapse for anon THPs and
> forget about taking proper care of exclusive flags.
Hey David,
Sorry for the (very) delayed review. As our helpful syncs offline, I'm in a
better place to understand the intent of these tests.
On Wed, Jan 4, 2023 at 6:49 AM David Hildenbrand <[email protected]> wrote:
> Running the test on a kernel with MADV_COLLAPSE support:
> # [INFO] Anonymous THP tests
> # [RUN] Basic COW after fork() when collapsing before fork()
> ok 169 No leak from parent into child
> # [RUN] Basic COW after fork() when collapsing after fork() (fully shared)
> ok 170 # SKIP MADV_COLLAPSE failed: Invalid argument
> # [RUN] Basic COW after fork() when collapsing after fork() (lower shared)
> ok 171 No leak from parent into child
> # [RUN] Basic COW after fork() when collapsing after fork() (upper shared)
> ok 172 No leak from parent into child
>
> For now, MADV_COLLAPSE always seems to fail if all PTEs map shared
> sub-pages.
Thanks for pointing this out. I have had a TODO / pending patch to verify this
for awhile. It seems this is due to some old requirement of requiring a single
writeable pte. I don't know this history well here, but I don't think it's
required anymore, and, as this test shows, prevents collapse of
pte-mapped-hugepage shared across fork().
>
> Cc: Andrew Morton <[email protected]>
> Cc: Shuah Khan <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Peter Xu <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Nadav Amit <[email protected]>
> Cc: Zach O'Keefe <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
>
> A patch from Hugh made me explore the wonderful world of in-place collapse
> of THP, and I was briefly concerned that it would apply to anon THP as
> well. After thinking about it a bit, I decided to add test cases, to better
> be safe than sorry in any case, and to document how PG_anon_exclusive is to
> be handled in that case.
>
> ---
> tools/testing/selftests/vm/cow.c | 228 +++++++++++++++++++++++++++++++
> 1 file changed, 228 insertions(+)
>
> diff --git a/tools/testing/selftests/vm/cow.c b/tools/testing/selftests/vm/cow.c
> index 26f6ea3079e2..16216d893d96 100644
> --- a/tools/testing/selftests/vm/cow.c
> +++ b/tools/testing/selftests/vm/cow.c
> @@ -30,6 +30,10 @@
> #include "../kselftest.h"
> #include "vm_util.h"
>
> +#ifndef MADV_COLLAPSE
> +#define MADV_COLLAPSE 25
> +#endif
> +
> static size_t pagesize;
> static int pagemap_fd;
> static size_t thpsize;
> @@ -1178,6 +1182,228 @@ static int tests_per_anon_test_case(void)
> return tests;
> }
>
> +enum anon_thp_collapse_test {
> + ANON_THP_COLLAPSE_UNSHARED,
OK, so this test checks case 2: we see all PG_anon_exclusive, and need to make
sure we clear the bit on tails. Had we not, after fork(), the bit would still be
set on tails (since copy_huge_pmd() -> page_try_dup_anon_rmap() only clears it
on head), and so write to said tails would go through after wp fault, and since
collapse was in-place, this leaks data from parent to child.
> + ANON_THP_COLLAPSE_FULLY_SHARED,
This checks case 1: we see all !PG_anon_exclusive, we aught to set the flag on
head page in parent, after fork(). Had we not, subsequent write will be allowed
to go through after wp fault and hit backing page -- which, since collapse was
in-place, is same as child, leaking data.
> + ANON_THP_COLLAPSE_LOWER_SHARED,
> + ANON_THP_COLLAPSE_UPPER_SHARED,
IIUC, this check only partially tests case 3. Had we introduced a bug where we
set PG_anon_exclusive on the head in this mixed-case, it's very similar to
ANON_THP_COLLAPSE_FULLY_SHARED.
However, if we decided to still attempt in-place collapse, and cleared the bit
in the parent, then the write here will be CoW'd and we won't see data leak
into the child. In order for problems to occur, we'd need something to trip
the improper CoW. The example you've shared with me was a io_uring fixed buffer
mapping the non-shared pages, which, after CoW, would disagree.
That said, I'm not sure the extra work required to catch this case is worth the
effort.
> +};
> +
> +static void do_test_anon_thp_collapse(char *mem, size_t size,
> + enum anon_thp_collapse_test test)
> +{
> + struct comm_pipes comm_pipes;
> + char buf;
> + int ret;
> +
> + ret = setup_comm_pipes(&comm_pipes);
> + if (ret) {
> + ksft_test_result_fail("pipe() failed\n");
> + return;
> + }
> +
> + /*
> + * Trigger PTE-mapping the THP by temporarily mapping a single subpage
> + * R/O, such that we can try collapsing it later.
> + */
> + ret = mprotect(mem + pagesize, pagesize, PROT_READ);
> + if (ret) {
> + ksft_test_result_fail("mprotect() failed\n");
> + goto close_comm_pipes;
> + }
> + ret = mprotect(mem + pagesize, pagesize, PROT_READ | PROT_WRITE);
> + if (ret) {
> + ksft_test_result_fail("mprotect() failed\n");
> + goto close_comm_pipes;
> + }
Might be a good place for a check_huge_anon(mem, 0, ..) to validate pte-mapped.
> + switch (test) {
> + case ANON_THP_COLLAPSE_UNSHARED:
> + /* Collapse before actually COW-sharing the page. */
> + ret = madvise(mem, size, MADV_COLLAPSE);
> + if (ret) {
> + ksft_test_result_skip("MADV_COLLAPSE failed: %s\n",
> + strerror(errno));
> + goto close_comm_pipes;
> + }
> + break;
> + case ANON_THP_COLLAPSE_FULLY_SHARED:
> + /* COW-share the full PTE-mapped THP. */
> + break;
> + case ANON_THP_COLLAPSE_LOWER_SHARED:
> + /* Don't COW-share the upper part of the THP. */
> + ret = madvise(mem + size / 2, size / 2, MADV_DONTFORK);
> + if (ret) {
> + ksft_test_result_fail("MADV_DONTFORK failed\n");
> + goto close_comm_pipes;
> + }
> + break;
> + case ANON_THP_COLLAPSE_UPPER_SHARED:
> + /* Don't COW-share the lower part of the THP. */
> + ret = madvise(mem, size / 2, MADV_DONTFORK);
> + if (ret) {
> + ksft_test_result_fail("MADV_DONTFORK failed\n");
> + goto close_comm_pipes;
> + }
> + break;
> + default:
> + assert(false);
> + }
> +
> + ret = fork();
> + if (ret < 0) {
> + ksft_test_result_fail("fork() failed\n");
> + goto close_comm_pipes;
> + } else if (!ret) {
> + switch (test) {
> + case ANON_THP_COLLAPSE_UNSHARED:
> + case ANON_THP_COLLAPSE_FULLY_SHARED:
> + exit(child_memcmp_fn(mem, size, &comm_pipes));
> + break;
> + case ANON_THP_COLLAPSE_LOWER_SHARED:
> + exit(child_memcmp_fn(mem, size / 2, &comm_pipes));
> + break;
> + case ANON_THP_COLLAPSE_UPPER_SHARED:
> + exit(child_memcmp_fn(mem + size / 2, size / 2,
> + &comm_pipes));
> + break;
> + default:
> + assert(false);
> + }
> + }
> +
> + while (read(comm_pipes.child_ready[0], &buf, 1) != 1)
> + ;
> +
> + switch (test) {
> + case ANON_THP_COLLAPSE_UNSHARED:
> + break;
> + case ANON_THP_COLLAPSE_UPPER_SHARED:
> + case ANON_THP_COLLAPSE_LOWER_SHARED:
> + /*
> + * Revert MADV_DONTFORK such that we merge the VMAs and are
> + * able to actually collapse.
> + */
> + ret = madvise(mem, size, MADV_DOFORK);
> + if (ret) {
> + ksft_test_result_fail("MADV_DOFORK failed\n");
> + write(comm_pipes.parent_ready[1], "0", 1);
> + wait(&ret);
> + goto close_comm_pipes;
> + }
> + /* FALLTHROUGH */
> + case ANON_THP_COLLAPSE_FULLY_SHARED:
> + /* Collapse before anyone modified the COW-shared page. */
> + ret = madvise(mem, size, MADV_COLLAPSE);
> + if (ret) {
> + ksft_test_result_skip("MADV_COLLAPSE failed: %s\n",
> + strerror(errno));
> + write(comm_pipes.parent_ready[1], "0", 1);
> + wait(&ret);
> + goto close_comm_pipes;
> + }
> + break;
> + default:
> + assert(false);
> + }
> +
> + /* Modify the page. */
> + memset(mem, 0xff, size);
> + write(comm_pipes.parent_ready[1], "0", 1);
> +
> + wait(&ret);
> + if (WIFEXITED(ret))
> + ret = WEXITSTATUS(ret);
> + else
> + ret = -EINVAL;
> +
> + ksft_test_result(!ret, "No leak from parent into child\n");
> +close_comm_pipes:
> + close_comm_pipes(&comm_pipes);
> +}
> +
> +static void test_anon_thp_collapse_unshared(char *mem, size_t size)
> +{
> + do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_UNSHARED);
> +}
> +
> +static void test_anon_thp_collapse_fully_shared(char *mem, size_t size)
> +{
> + do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_FULLY_SHARED);
> +}
> +
> +static void test_anon_thp_collapse_lower_shared(char *mem, size_t size)
> +{
> + do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_LOWER_SHARED);
> +}
> +
> +static void test_anon_thp_collapse_upper_shared(char *mem, size_t size)
> +{
> + do_test_anon_thp_collapse(mem, size, ANON_THP_COLLAPSE_UPPER_SHARED);
> +}
> +
> +/*
> + * Test cases that are specific to anonymous THP: pages in private mappings
> + * that may get shared via COW during fork().
> + */
> +static const struct test_case anon_thp_test_cases[] = {
> + /*
> + * Basic COW test for fork() without any GUP when collapsing a THP
> + * before fork().
> + *
> + * Re-mapping a PTE-mapped anon THP using a single PMD ("in-place
> + * collapse") might easily get COW handling wrong when not collapsing
> + * exclusivity information properly.
> + */
> + {
> + "Basic COW after fork() when collapsing before fork()",
> + test_anon_thp_collapse_unshared,
> + },
> + /* Basic COW test, but collapse after COW-sharing a full THP. */
> + {
> + "Basic COW after fork() when collapsing after fork() (fully shared)",
> + test_anon_thp_collapse_fully_shared,
> + },
> + /*
> + * Basic COW test, but collapse after COW-sharing the lower half of a
> + * THP.
> + */
> + {
> + "Basic COW after fork() when collapsing after fork() (lower shared)",
> + test_anon_thp_collapse_lower_shared,
> + },
> + /*
> + * Basic COW test, but collapse after COW-sharing the upper half of a
> + * THP.
> + */
> + {
> + "Basic COW after fork() when collapsing after fork() (upper shared)",
> + test_anon_thp_collapse_upper_shared,
> + },
> +};
> +
> +static void run_anon_thp_test_cases(void)
> +{
> + int i;
> +
> + if (!thpsize)
> + return;
> +
> + ksft_print_msg("[INFO] Anonymous THP tests\n");
> +
> + for (i = 0; i < ARRAY_SIZE(anon_thp_test_cases); i++) {
> + struct test_case const *test_case = &anon_thp_test_cases[i];
> +
> + ksft_print_msg("[RUN] %s\n", test_case->desc);
> + do_run_with_thp(test_case->fn, THP_RUN_PMD);
> + }
> +}
> +
> +static int tests_per_anon_thp_test_case(void)
> +{
> + return thpsize ? 1 : 0;
> +}
> +
> typedef void (*non_anon_test_fn)(char *mem, const char *smem, size_t size);
>
> static void test_cow(char *mem, const char *smem, size_t size)
> @@ -1518,6 +1744,7 @@ int main(int argc, char **argv)
>
> ksft_print_header();
> ksft_set_plan(ARRAY_SIZE(anon_test_cases) * tests_per_anon_test_case() +
> + ARRAY_SIZE(anon_thp_test_cases) * tests_per_anon_thp_test_case() +
> ARRAY_SIZE(non_anon_test_cases) * tests_per_non_anon_test_case());
>
> gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
> @@ -1526,6 +1753,7 @@ int main(int argc, char **argv)
> ksft_exit_fail_msg("opening pagemap failed\n");
>
> run_anon_test_cases();
> + run_anon_thp_test_cases();
> run_non_anon_test_cases();
>
> err = ksft_get_fail_cnt();
> --
> 2.39.0
>
Overall the tests look good, though too late to record that in the log. At least
mail archives will have it.
On 16.03.23 00:28, Zach O'Keefe wrote:
>>
>> Currently, anonymous PTE-mapped THPs cannot be collapsed in-place:
>> collapsing (e.g., via MADV_COLLAPSE) implies allocating a fresh THP and
>> mapping that new THP via a PMD: as it's a fresh anon THP, it will get the
>> exclusive flag set on the head page and everybody is happy.
>>
>> However, if the kernel would ever support in-place collapse of anonymous
>> THPs (replacing a page table mapping each sub-page of a THP via PTEs with a
>> single PMD mapping the complete THP), exclusivity information stored for
>> each sub-page would have to be collapsed accordingly:
>>
>> (1) All PTEs map !exclusive anon sub-pages: the in-place collapsed THP
>> must not not have the exclusive flag set on the head page mapped by
>> the PMD. This is the easiest case to handle ("simply don't set any
>> exclusive flags").
>>
>> (2) All PTEs map exclusive anon sub-pages: when collapsing, we have to
>> clear the exclusive flag from all tail pages and only leave the
>> exclusive flag set for the head page. Otherwise, fork() after
>> collapse would not clear the exclusive flags from the tail pages
>> and we'd be in trouble once PTE-mapping the shared THP when writing
>> to shared tail pages that still have the exclusive flag set. This
>> would effectively revert what the PTE-mapping code does when
>> propagating the exclusive flag to all sub-pages.
>>
>> (3) PTEs map a mixture of exclusive and !exclusive anon sub-pages (can
>> happen e.g., due to MADV_DONTFORK before fork()). We must not
>> collapse the THP in-place, otherwise bad things may happen:
>> the exclusive flags of sub-pages would get ignored and the
>> exclusive flag of the head page would get used instead.
>>
>> Now that we have MADV_COLLAPSE in place to trigger collapsing a THP,
>> let's add some test cases that would bail out early, if we'd
>> voluntarily/accidantially unlock in-place collapse for anon THPs and
>> forget about taking proper care of exclusive flags.
>
> Hey David,
>
> Sorry for the (very) delayed review. As our helpful syncs offline, I'm in a
> better place to understand the intent of these tests.
Thanks for the review and sorry for the late reply!
>
> On Wed, Jan 4, 2023 at 6:49 AM David Hildenbrand <[email protected]> wrote:
>
>> Running the test on a kernel with MADV_COLLAPSE support:
>> # [INFO] Anonymous THP tests
>> # [RUN] Basic COW after fork() when collapsing before fork()
>> ok 169 No leak from parent into child
>> # [RUN] Basic COW after fork() when collapsing after fork() (fully shared)
>> ok 170 # SKIP MADV_COLLAPSE failed: Invalid argument
>> # [RUN] Basic COW after fork() when collapsing after fork() (lower shared)
>> ok 171 No leak from parent into child
>> # [RUN] Basic COW after fork() when collapsing after fork() (upper shared)
>> ok 172 No leak from parent into child
>>
>> For now, MADV_COLLAPSE always seems to fail if all PTEs map shared
>> sub-pages.
>
> Thanks for pointing this out. I have had a TODO / pending patch to verify this
> for awhile. It seems this is due to some old requirement of requiring a single
> writeable pte. I don't know this history well here, but I don't think it's
Interesting. A single writable PTE would indicate that at least some
part of the range is exclusive (!shared) without looking at mapcounts.
But of course, it's an unreliable check.
> required anymore, and, as this test shows, prevents collapse of
> pte-mapped-hugepage shared across fork().
[...]
>>
>> +#ifndef MADV_COLLAPSE
>> +#define MADV_COLLAPSE 25
>> +#endif
>> +
>> static size_t pagesize;
>> static int pagemap_fd;
>> static size_t thpsize;
>> @@ -1178,6 +1182,228 @@ static int tests_per_anon_test_case(void)
>> return tests;
>> }
>>
>> +enum anon_thp_collapse_test {
>> + ANON_THP_COLLAPSE_UNSHARED,
>
> OK, so this test checks case 2: we see all PG_anon_exclusive, and need to make
> sure we clear the bit on tails. Had we not, after fork(), the bit would still be
> set on tails (since copy_huge_pmd() -> page_try_dup_anon_rmap() only clears it
> on head), and so write to said tails would go through after wp fault, and since
> collapse was in-place, this leaks data from parent to child.
>
>> + ANON_THP_COLLAPSE_FULLY_SHARED,
>
> This checks case 1: we see all !PG_anon_exclusive, we aught to set the flag on
> head page in parent, after fork(). Had we not, subsequent write will be allowed
> to go through after wp fault and hit backing page -- which, since collapse was
> in-place, is same as child, leaking data.
>
>> + ANON_THP_COLLAPSE_LOWER_SHARED,
>> + ANON_THP_COLLAPSE_UPPER_SHARED,
>
> IIUC, this check only partially tests case 3. Had we introduced a bug where we
> set PG_anon_exclusive on the head in this mixed-case, it's very similar to
> ANON_THP_COLLAPSE_FULLY_SHARED.
>
> However, if we decided to still attempt in-place collapse, and cleared the bit
> in the parent, then the write here will be CoW'd and we won't see data leak
> into the child. In order for problems to occur, we'd need something to trip
> the improper CoW. The example you've shared with me was a io_uring fixed buffer
> mapping the non-shared pages, which, after CoW, would disagree.
>
> That said, I'm not sure the extra work required to catch this case is worth the
> effort.
Yes, just like most tests, covering all corner cases takes a lot of
effort and is probably not worth it. Wiring up io_uring (also used in
the file already for these purposes) would be possible, however the
initial tests are really more targeted at "bail out early" and to catch
the obvious things one might miss when collapsing THP.
IOW: when people ignore exclusivity information when collapsing; the
failing tests would directly tell you what needs to be done: better not
mess with case #3. So IMHO, the basic tests for #3 are sufficient to
express "see, you did it wrong: better be careful".
There are scenarios where we could handle #3: for example, if we're the
only one mapping all sub-pages of a THP and there are no other
references (i.e., page_count() == 512), we could collapse and mark the
head page exclusive (clearing the marker from any tail pages). Once we
really want to optimize these cases, we can add more such tests.
--
Thanks,
David / dhildenb