2017-04-10 09:48:29

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

A user reported a bug against a distribution kernel while running
a proprietary workload described as "memory intensive that is not
swapping" that is expected to apply to mainline kernels. The workload
is read/write/modifying ranges of memory and checking the contents. They
reported that within a few hours that a bad PMD would be reported followed
by a memory corruption where expected data was all zeros. A partial report
of the bad PMD looked like

[ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
[ 5195.341184] ------------[ cut here ]------------
[ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
....
[ 5195.410033] Call Trace:
[ 5195.410471] [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
[ 5195.410716] [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
[ 5195.410918] [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
[ 5195.411200] [<ffffffff81098322>] task_work_run+0x72/0x90
[ 5195.411246] [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
[ 5195.411494] [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
[ 5195.411739] [<ffffffff815e56af>] retint_user+0x8/0x10

Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
was a false detection. The bug does not trigger if automatic NUMA balancing
or transparent huge pages is disabled.

The bug is due a race in change_pmd_range between a pmd_trans_huge and
pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
check, a parallel protection update under lock can have cleared the PMD
and filled it with a prot_numa entry between the transhuge check and the
pmd_none_or_clear_bad check.

While this could be fixed with heavy locking, it's only necessary to
make a copy of the PMD on the stack during change_pmd_range and avoid
races. A new helper is created for this as the check if quite subtle and the
existing similar helpful is not suitable. This passed 154 hours of testing
(usually triggers between 20 minutes and 24 hours) without detecting bad
PMDs or corruption. A basic test of an autonuma-intensive workload showed
no significant change in behaviour.

Signed-off-by: Mel Gorman <[email protected]>
Cc: [email protected]
---
include/asm-generic/pgtable.h | 25 +++++++++++++++++++++++++
mm/mprotect.c | 12 ++++++++++--
2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 1fad160f35de..597fa482cd4a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -819,6 +819,31 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
}

/*
+ * Used when setting automatic NUMA hinting protection where it is
+ * critical that a numa hinting PMD is not confused with a bad PMD.
+ */
+static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
+{
+ pmd_t pmdval = pmd_read_atomic(pmd);
+
+ /* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ barrier();
+#endif
+
+ if (pmd_none(pmdval))
+ return 1;
+ if (pmd_trans_huge(pmdval))
+ return 0;
+ if (unlikely(pmd_bad(pmdval))) {
+ pmd_clear_bad(pmd);
+ return 1;
+ }
+ return 0;
+}
+
+
+/*
* This is a noop if Transparent Hugepage Support is not built into
* the kernel. Otherwise it is equivalent to
* pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 8edd0d576254..821ff2904cdb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -150,8 +150,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
unsigned long this_pages;

next = pmd_addr_end(addr, end);
- if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
- && pmd_none_or_clear_bad(pmd))
+
+ /*
+ * Automatic NUMA balancing walks the tables with mmap_sem
+ * held for read. It's possible a parallel update
+ * to occur between pmd_trans_huge and a pmd_none_or_clear_bad
+ * check leading to a false positive and clearing. Hence, it's
+ * necessary to atomically read the PMD value for all the
+ * checks.
+ */
+ if (!pmd_devmap(*pmd) && pmd_none_or_clear_bad_unless_trans_huge(pmd))
continue;

/* invoke the mmu notifier if the pmd is populated */


2017-04-10 10:03:24

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

On 04/10/2017 11:48 AM, Mel Gorman wrote:
> A user reported a bug against a distribution kernel while running
> a proprietary workload described as "memory intensive that is not
> swapping" that is expected to apply to mainline kernels. The workload
> is read/write/modifying ranges of memory and checking the contents. They
> reported that within a few hours that a bad PMD would be reported followed
> by a memory corruption where expected data was all zeros. A partial report
> of the bad PMD looked like
>
> [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
> [ 5195.341184] ------------[ cut here ]------------
> [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
> ....
> [ 5195.410033] Call Trace:
> [ 5195.410471] [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
> [ 5195.410716] [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
> [ 5195.410918] [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
> [ 5195.411200] [<ffffffff81098322>] task_work_run+0x72/0x90
> [ 5195.411246] [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
> [ 5195.411494] [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
> [ 5195.411739] [<ffffffff815e56af>] retint_user+0x8/0x10
>
> Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
> was a false detection. The bug does not trigger if automatic NUMA balancing
> or transparent huge pages is disabled.
>
> The bug is due a race in change_pmd_range between a pmd_trans_huge and
> pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
> check, a parallel protection update under lock can have cleared the PMD
> and filled it with a prot_numa entry between the transhuge check and the
> pmd_none_or_clear_bad check.
>
> While this could be fixed with heavy locking, it's only necessary to
> make a copy of the PMD on the stack during change_pmd_range and avoid
> races. A new helper is created for this as the check if quite subtle and the
> existing similar helpful is not suitable. This passed 154 hours of testing
> (usually triggers between 20 minutes and 24 hours) without detecting bad
> PMDs or corruption. A basic test of an autonuma-intensive workload showed
> no significant change in behaviour.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Cc: [email protected]

It would be better if there was a Fixes: tag, or at least version hint. Assuming
it's since autonuma balancing was merged?

Acked-by: Vlastimil Babka <[email protected]>

> ---
> include/asm-generic/pgtable.h | 25 +++++++++++++++++++++++++
> mm/mprotect.c | 12 ++++++++++--
> 2 files changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 1fad160f35de..597fa482cd4a 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -819,6 +819,31 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
> }
>
> /*
> + * Used when setting automatic NUMA hinting protection where it is
> + * critical that a numa hinting PMD is not confused with a bad PMD.
> + */
> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> +{
> + pmd_t pmdval = pmd_read_atomic(pmd);
> +
> + /* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + barrier();
> +#endif
> +
> + if (pmd_none(pmdval))
> + return 1;
> + if (pmd_trans_huge(pmdval))
> + return 0;
> + if (unlikely(pmd_bad(pmdval))) {
> + pmd_clear_bad(pmd);
> + return 1;
> + }
> + return 0;
> +}
> +
> +
> +/*
> * This is a noop if Transparent Hugepage Support is not built into
> * the kernel. Otherwise it is equivalent to
> * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 8edd0d576254..821ff2904cdb 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -150,8 +150,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> unsigned long this_pages;
>
> next = pmd_addr_end(addr, end);
> - if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> - && pmd_none_or_clear_bad(pmd))
> +
> + /*
> + * Automatic NUMA balancing walks the tables with mmap_sem
> + * held for read. It's possible a parallel update
> + * to occur between pmd_trans_huge and a pmd_none_or_clear_bad
> + * check leading to a false positive and clearing. Hence, it's
> + * necessary to atomically read the PMD value for all the
> + * checks.
> + */
> + if (!pmd_devmap(*pmd) && pmd_none_or_clear_bad_unless_trans_huge(pmd))
> continue;
>
> /* invoke the mmu notifier if the pmd is populated */
>

2017-04-10 12:19:35

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

On Mon, Apr 10, 2017 at 12:03:20PM +0200, Vlastimil Babka wrote:
> On 04/10/2017 11:48 AM, Mel Gorman wrote:
> > A user reported a bug against a distribution kernel while running
> > a proprietary workload described as "memory intensive that is not
> > swapping" that is expected to apply to mainline kernels. The workload
> > is read/write/modifying ranges of memory and checking the contents. They
> > reported that within a few hours that a bad PMD would be reported followed
> > by a memory corruption where expected data was all zeros. A partial report
> > of the bad PMD looked like
> >
> > [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
> > [ 5195.341184] ------------[ cut here ]------------
> > [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
> > ....
> > [ 5195.410033] Call Trace:
> > [ 5195.410471] [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
> > [ 5195.410716] [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
> > [ 5195.410918] [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
> > [ 5195.411200] [<ffffffff81098322>] task_work_run+0x72/0x90
> > [ 5195.411246] [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
> > [ 5195.411494] [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
> > [ 5195.411739] [<ffffffff815e56af>] retint_user+0x8/0x10
> >
> > Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
> > was a false detection. The bug does not trigger if automatic NUMA balancing
> > or transparent huge pages is disabled.
> >
> > The bug is due a race in change_pmd_range between a pmd_trans_huge and
> > pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
> > check, a parallel protection update under lock can have cleared the PMD
> > and filled it with a prot_numa entry between the transhuge check and the
> > pmd_none_or_clear_bad check.
> >
> > While this could be fixed with heavy locking, it's only necessary to
> > make a copy of the PMD on the stack during change_pmd_range and avoid
> > races. A new helper is created for this as the check if quite subtle and the
> > existing similar helpful is not suitable. This passed 154 hours of testing
> > (usually triggers between 20 minutes and 24 hours) without detecting bad
> > PMDs or corruption. A basic test of an autonuma-intensive workload showed
> > no significant change in behaviour.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > Cc: [email protected]
>
> It would be better if there was a Fixes: tag, or at least version hint. Assuming
> it's since autonuma balancing was merged?
>

Fair point. It's all the way back to 3.15 rather than all the way back to
the introduction of automatic NUMA balancing so

Cc: [email protected] # 3.15+

--
Mel Gorman
SUSE Labs

2017-04-10 12:38:28

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

On Mon, 2017-04-10 at 10:48 +0100, Mel Gorman wrote:
>
> While this could be fixed with heavy locking, it's only necessary to
> make a copy of the PMD on the stack during change_pmd_range and avoid
> races. A new helper is created for this as the check if quite subtle
> and the
> existing similar helpful is not suitable. This passed 154 hours of
> testing
> (usually triggers between 20 minutes and 24 hours) without detecting
> bad
> PMDs or corruption. A basic test of an autonuma-intensive workload
> showed
> no significant change in behaviour.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Cc: [email protected]

Acked-by: Rik van Riel <[email protected]>

2017-04-10 13:53:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

On Mon 10-04-17 10:48:25, Mel Gorman wrote:
> A user reported a bug against a distribution kernel while running
> a proprietary workload described as "memory intensive that is not
> swapping" that is expected to apply to mainline kernels. The workload
> is read/write/modifying ranges of memory and checking the contents. They
> reported that within a few hours that a bad PMD would be reported followed
> by a memory corruption where expected data was all zeros. A partial report
> of the bad PMD looked like
>
> [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
> [ 5195.341184] ------------[ cut here ]------------
> [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
> ....
> [ 5195.410033] Call Trace:
> [ 5195.410471] [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
> [ 5195.410716] [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
> [ 5195.410918] [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
> [ 5195.411200] [<ffffffff81098322>] task_work_run+0x72/0x90
> [ 5195.411246] [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
> [ 5195.411494] [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
> [ 5195.411739] [<ffffffff815e56af>] retint_user+0x8/0x10
>
> Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
> was a false detection. The bug does not trigger if automatic NUMA balancing
> or transparent huge pages is disabled.
>
> The bug is due a race in change_pmd_range between a pmd_trans_huge and
> pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
> check, a parallel protection update under lock can have cleared the PMD
> and filled it with a prot_numa entry between the transhuge check and the
> pmd_none_or_clear_bad check.
>
> While this could be fixed with heavy locking, it's only necessary to
> make a copy of the PMD on the stack during change_pmd_range and avoid
> races. A new helper is created for this as the check if quite subtle and the
> existing similar helpful is not suitable. This passed 154 hours of testing

s@helpful@helper@ I suspect

> (usually triggers between 20 minutes and 24 hours) without detecting bad
> PMDs or corruption. A basic test of an autonuma-intensive workload showed
> no significant change in behaviour.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Cc: [email protected]

Acked-by: Michal Hocko <[email protected]>

you will probably win the_longest_function_name_contest but I do not
have much better suggestion.

> ---
> include/asm-generic/pgtable.h | 25 +++++++++++++++++++++++++
> mm/mprotect.c | 12 ++++++++++--
> 2 files changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 1fad160f35de..597fa482cd4a 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -819,6 +819,31 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
> }
>
> /*
> + * Used when setting automatic NUMA hinting protection where it is
> + * critical that a numa hinting PMD is not confused with a bad PMD.
> + */
> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> +{
> + pmd_t pmdval = pmd_read_atomic(pmd);
> +
> + /* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + barrier();
> +#endif
> +
> + if (pmd_none(pmdval))
> + return 1;
> + if (pmd_trans_huge(pmdval))
> + return 0;
> + if (unlikely(pmd_bad(pmdval))) {
> + pmd_clear_bad(pmd);
> + return 1;
> + }
> + return 0;
> +}
> +
> +
> +/*
> * This is a noop if Transparent Hugepage Support is not built into
> * the kernel. Otherwise it is equivalent to
> * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 8edd0d576254..821ff2904cdb 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -150,8 +150,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> unsigned long this_pages;
>
> next = pmd_addr_end(addr, end);
> - if (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> - && pmd_none_or_clear_bad(pmd))
> +
> + /*
> + * Automatic NUMA balancing walks the tables with mmap_sem
> + * held for read. It's possible a parallel update
> + * to occur between pmd_trans_huge and a pmd_none_or_clear_bad
> + * check leading to a false positive and clearing. Hence, it's
> + * necessary to atomically read the PMD value for all the
> + * checks.
> + */
> + if (!pmd_devmap(*pmd) && pmd_none_or_clear_bad_unless_trans_huge(pmd))
> continue;
>
> /* invoke the mmu notifier if the pmd is populated */

--
Michal Hocko
SUSE Labs

2017-04-10 17:21:01

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

On Mon, Apr 10, 2017 at 11:45:08AM -0500, Zi Yan wrote:
> > While this could be fixed with heavy locking, it's only necessary to
> > make a copy of the PMD on the stack during change_pmd_range and avoid
> > races. A new helper is created for this as the check if quite subtle and the
> > existing similar helpful is not suitable. This passed 154 hours of testing
> > (usually triggers between 20 minutes and 24 hours) without detecting bad
> > PMDs or corruption. A basic test of an autonuma-intensive workload showed
> > no significant change in behaviour.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > Cc: [email protected]
>
> Does this patch fix the same problem fixed by Kirill's patch here?
> https://lkml.org/lkml/2017/3/2/347
>

I don't think so. The race I'm concerned with is due to locks not being
held and is in a different path.

--
Mel Gorman
SUSE Labs

2017-04-10 17:39:11

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

On Mon, Apr 10, 2017 at 03:53:42PM +0200, Michal Hocko wrote:
> > While this could be fixed with heavy locking, it's only necessary to
> > make a copy of the PMD on the stack during change_pmd_range and avoid
> > races. A new helper is created for this as the check if quite subtle and the
> > existing similar helpful is not suitable. This passed 154 hours of testing
>
> s@helpful@helper@ I suspect
>

Yes. I'll wait to see if there is more feedback and if not, resend unless
Andrew decides to pick it up and correct the mistake directly.

> > (usually triggers between 20 minutes and 24 hours) without detecting bad
> > PMDs or corruption. A basic test of an autonuma-intensive workload showed
> > no significant change in behaviour.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > Cc: [email protected]
>
> Acked-by: Michal Hocko <[email protected]>
>

Thanks.

> you will probably win the_longest_function_name_contest but I do not
> have much better suggestion.
>

I know, it's not a type of function that yields a snappy name.

--
Mel Gorman
SUSE Labs

2017-04-10 17:45:16

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

On 10 Apr 2017, at 4:48, Mel Gorman wrote:

> A user reported a bug against a distribution kernel while running
> a proprietary workload described as "memory intensive that is not
> swapping" that is expected to apply to mainline kernels. The workload
> is read/write/modifying ranges of memory and checking the contents. They
> reported that within a few hours that a bad PMD would be reported followed
> by a memory corruption where expected data was all zeros. A partial report
> of the bad PMD looked like
>
> [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
> [ 5195.341184] ------------[ cut here ]------------
> [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
> ....
> [ 5195.410033] Call Trace:
> [ 5195.410471] [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
> [ 5195.410716] [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
> [ 5195.410918] [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
> [ 5195.411200] [<ffffffff81098322>] task_work_run+0x72/0x90
> [ 5195.411246] [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
> [ 5195.411494] [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
> [ 5195.411739] [<ffffffff815e56af>] retint_user+0x8/0x10
>
> Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
> was a false detection. The bug does not trigger if automatic NUMA balancing
> or transparent huge pages is disabled.
>
> The bug is due a race in change_pmd_range between a pmd_trans_huge and
> pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
> check, a parallel protection update under lock can have cleared the PMD
> and filled it with a prot_numa entry between the transhuge check and the
> pmd_none_or_clear_bad check.
>
> While this could be fixed with heavy locking, it's only necessary to
> make a copy of the PMD on the stack during change_pmd_range and avoid
> races. A new helper is created for this as the check if quite subtle and the
> existing similar helpful is not suitable. This passed 154 hours of testing
> (usually triggers between 20 minutes and 24 hours) without detecting bad
> PMDs or corruption. A basic test of an autonuma-intensive workload showed
> no significant change in behaviour.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Cc: [email protected]

Does this patch fix the same problem fixed by Kirill's patch here?
https://lkml.org/lkml/2017/3/2/347

--
Best Regards
Yan Zi


Attachments:
signature.asc (496.00 B)
OpenPGP digital signature

2017-04-10 17:49:50

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

On 10 Apr 2017, at 12:20, Mel Gorman wrote:

> On Mon, Apr 10, 2017 at 11:45:08AM -0500, Zi Yan wrote:
>>> While this could be fixed with heavy locking, it's only necessary to
>>> make a copy of the PMD on the stack during change_pmd_range and avoid
>>> races. A new helper is created for this as the check if quite subtle and the
>>> existing similar helpful is not suitable. This passed 154 hours of testing
>>> (usually triggers between 20 minutes and 24 hours) without detecting bad
>>> PMDs or corruption. A basic test of an autonuma-intensive workload showed
>>> no significant change in behaviour.
>>>
>>> Signed-off-by: Mel Gorman <[email protected]>
>>> Cc: [email protected]
>>
>> Does this patch fix the same problem fixed by Kirill's patch here?
>> https://lkml.org/lkml/2017/3/2/347
>>
>
> I don't think so. The race I'm concerned with is due to locks not being
> held and is in a different path.

I do not agree. Kirill's patch is fixing the same race problem but in
zap_pmd_range().

The original autoNUMA code first clears PMD then sets it to protnone entry.
pmd_trans_huge() does not return TRUE because it saw cleared PMD, but
pmd_none_or_clear_bad() later saw the protnone entry and reported it as bad.
Is this the problem you are trying solve?

Kirill's patch will pmdp_invalidate() the PMD entry, which keeps _PAGE_PSE bit,
so pmd_trans_huge() will return TRUE. In this case, it also fixes
your race problem in change_pmd_range().

Let me know if I miss anything.

Thanks.

--
Best Regards
Yan Zi


Attachments:
signature.asc (496.00 B)
OpenPGP digital signature

2017-04-10 18:07:17

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

On Mon, Apr 10, 2017 at 12:49:40PM -0500, Zi Yan wrote:
> On 10 Apr 2017, at 12:20, Mel Gorman wrote:
>
> > On Mon, Apr 10, 2017 at 11:45:08AM -0500, Zi Yan wrote:
> >>> While this could be fixed with heavy locking, it's only necessary to
> >>> make a copy of the PMD on the stack during change_pmd_range and avoid
> >>> races. A new helper is created for this as the check if quite subtle and the
> >>> existing similar helpful is not suitable. This passed 154 hours of testing
> >>> (usually triggers between 20 minutes and 24 hours) without detecting bad
> >>> PMDs or corruption. A basic test of an autonuma-intensive workload showed
> >>> no significant change in behaviour.
> >>>
> >>> Signed-off-by: Mel Gorman <[email protected]>
> >>> Cc: [email protected]
> >>
> >> Does this patch fix the same problem fixed by Kirill's patch here?
> >> https://lkml.org/lkml/2017/3/2/347
> >>
> >
> > I don't think so. The race I'm concerned with is due to locks not being
> > held and is in a different path.
>
> I do not agree. Kirill's patch is fixing the same race problem but in
> zap_pmd_range().
>
> The original autoNUMA code first clears PMD then sets it to protnone entry.
> pmd_trans_huge() does not return TRUE because it saw cleared PMD, but
> pmd_none_or_clear_bad() later saw the protnone entry and reported it as bad.
> Is this the problem you are trying solve?
>
> Kirill's patch will pmdp_invalidate() the PMD entry, which keeps _PAGE_PSE bit,
> so pmd_trans_huge() will return TRUE. In this case, it also fixes
> your race problem in change_pmd_range().
>
> Let me know if I miss anything.
>

Ok, now I see. I think you're correct and I withdraw the patch.

--
Mel Gorman
SUSE Labs

2017-04-10 22:09:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

On Mon, 10 Apr 2017 19:07:14 +0100 Mel Gorman <[email protected]> wrote:

> On Mon, Apr 10, 2017 at 12:49:40PM -0500, Zi Yan wrote:
> > On 10 Apr 2017, at 12:20, Mel Gorman wrote:
> >
> > > On Mon, Apr 10, 2017 at 11:45:08AM -0500, Zi Yan wrote:
> > >>> While this could be fixed with heavy locking, it's only necessary to
> > >>> make a copy of the PMD on the stack during change_pmd_range and avoid
> > >>> races. A new helper is created for this as the check if quite subtle and the
> > >>> existing similar helpful is not suitable. This passed 154 hours of testing
> > >>> (usually triggers between 20 minutes and 24 hours) without detecting bad
> > >>> PMDs or corruption. A basic test of an autonuma-intensive workload showed
> > >>> no significant change in behaviour.
> > >>>
> > >>> Signed-off-by: Mel Gorman <[email protected]>
> > >>> Cc: [email protected]
> > >>
> > >> Does this patch fix the same problem fixed by Kirill's patch here?
> > >> https://lkml.org/lkml/2017/3/2/347
> > >>
> > >
> > > I don't think so. The race I'm concerned with is due to locks not being
> > > held and is in a different path.
> >
> > I do not agree. Kirill's patch is fixing the same race problem but in
> > zap_pmd_range().
> >
> > The original autoNUMA code first clears PMD then sets it to protnone entry.
> > pmd_trans_huge() does not return TRUE because it saw cleared PMD, but
> > pmd_none_or_clear_bad() later saw the protnone entry and reported it as bad.
> > Is this the problem you are trying solve?
> >
> > Kirill's patch will pmdp_invalidate() the PMD entry, which keeps _PAGE_PSE bit,
> > so pmd_trans_huge() will return TRUE. In this case, it also fixes
> > your race problem in change_pmd_range().
> >
> > Let me know if I miss anything.
> >
>
> Ok, now I see. I think you're correct and I withdraw the patch.

I have Kirrill's

thp-reduce-indentation-level-in-change_huge_pmd.patch
thp-fix-madv_dontneed-vs-numa-balancing-race.patch
mm-drop-unused-pmdp_huge_get_and_clear_notify.patch
thp-fix-madv_dontneed-vs-madv_free-race.patch
thp-fix-madv_dontneed-vs-madv_free-race-fix.patch
thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch

scheduled for 4.12-rc1. It sounds like
thp-fix-madv_dontneed-vs-madv_free-race.patch and
thp-fix-madv_dontneed-vs-madv_free-race.patch need to be boosted to
4.11 and stable?

2017-04-10 22:28:22

by Zi Yan

[permalink] [raw]
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

On 10 Apr 2017, at 17:09, Andrew Morton wrote:

> On Mon, 10 Apr 2017 19:07:14 +0100 Mel Gorman <[email protected]> wrote:
>
>> On Mon, Apr 10, 2017 at 12:49:40PM -0500, Zi Yan wrote:
>>> On 10 Apr 2017, at 12:20, Mel Gorman wrote:
>>>
>>>> On Mon, Apr 10, 2017 at 11:45:08AM -0500, Zi Yan wrote:
>>>>>> While this could be fixed with heavy locking, it's only necessary to
>>>>>> make a copy of the PMD on the stack during change_pmd_range and avoid
>>>>>> races. A new helper is created for this as the check if quite subtle and the
>>>>>> existing similar helpful is not suitable. This passed 154 hours of testing
>>>>>> (usually triggers between 20 minutes and 24 hours) without detecting bad
>>>>>> PMDs or corruption. A basic test of an autonuma-intensive workload showed
>>>>>> no significant change in behaviour.
>>>>>>
>>>>>> Signed-off-by: Mel Gorman <[email protected]>
>>>>>> Cc: [email protected]
>>>>>
>>>>> Does this patch fix the same problem fixed by Kirill's patch here?
>>>>> https://lkml.org/lkml/2017/3/2/347
>>>>>
>>>>
>>>> I don't think so. The race I'm concerned with is due to locks not being
>>>> held and is in a different path.
>>>
>>> I do not agree. Kirill's patch is fixing the same race problem but in
>>> zap_pmd_range().
>>>
>>> The original autoNUMA code first clears PMD then sets it to protnone entry.
>>> pmd_trans_huge() does not return TRUE because it saw cleared PMD, but
>>> pmd_none_or_clear_bad() later saw the protnone entry and reported it as bad.
>>> Is this the problem you are trying solve?
>>>
>>> Kirill's patch will pmdp_invalidate() the PMD entry, which keeps _PAGE_PSE bit,
>>> so pmd_trans_huge() will return TRUE. In this case, it also fixes
>>> your race problem in change_pmd_range().
>>>
>>> Let me know if I miss anything.
>>>
>>
>> Ok, now I see. I think you're correct and I withdraw the patch.
>
> I have Kirrill's
>
> thp-reduce-indentation-level-in-change_huge_pmd.patch
> thp-fix-madv_dontneed-vs-numa-balancing-race.patch
> mm-drop-unused-pmdp_huge_get_and_clear_notify.patch
> thp-fix-madv_dontneed-vs-madv_free-race.patch
> thp-fix-madv_dontneed-vs-madv_free-race-fix.patch
> thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch
>
> scheduled for 4.12-rc1. It sounds like
> thp-fix-madv_dontneed-vs-madv_free-race.patch and
> thp-fix-madv_dontneed-vs-madv_free-race.patch need to be boosted to
> 4.11 and stable?

thp-fix-madv_dontneed-vs-numa-balancing-race.patch is the fix for
numa balancing problem reported in this thread.

mm-drop-unused-pmdp_huge_get_and_clear_notify.patch,
thp-fix-madv_dontneed-vs-madv_free-race.patch,
thp-fix-madv_dontneed-vs-madv_free-race-fix.patch, and
thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch

are the fixes for other potential race problems similar to this one.

I think it is better to have all these patches applied.

--
Best Regards
Yan Zi


Attachments:
signature.asc (496.00 B)
OpenPGP digital signature

2017-04-11 06:35:06

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

On 04/11/2017 12:28 AM, Zi Yan wrote:
> On 10 Apr 2017, at 17:09, Andrew Morton wrote:
>
>> On Mon, 10 Apr 2017 19:07:14 +0100 Mel Gorman <[email protected]> wrote:
>>
>>> On Mon, Apr 10, 2017 at 12:49:40PM -0500, Zi Yan wrote:
>>>> On 10 Apr 2017, at 12:20, Mel Gorman wrote:
>>>>
>>>>> On Mon, Apr 10, 2017 at 11:45:08AM -0500, Zi Yan wrote:
>>>>>>> While this could be fixed with heavy locking, it's only necessary to
>>>>>>> make a copy of the PMD on the stack during change_pmd_range and avoid
>>>>>>> races. A new helper is created for this as the check if quite subtle and the
>>>>>>> existing similar helpful is not suitable. This passed 154 hours of testing
>>>>>>> (usually triggers between 20 minutes and 24 hours) without detecting bad
>>>>>>> PMDs or corruption. A basic test of an autonuma-intensive workload showed
>>>>>>> no significant change in behaviour.
>>>>>>>
>>>>>>> Signed-off-by: Mel Gorman <[email protected]>
>>>>>>> Cc: [email protected]
>>>>>>
>>>>>> Does this patch fix the same problem fixed by Kirill's patch here?
>>>>>> https://lkml.org/lkml/2017/3/2/347
>>>>>>
>>>>>
>>>>> I don't think so. The race I'm concerned with is due to locks not being
>>>>> held and is in a different path.
>>>>
>>>> I do not agree. Kirill's patch is fixing the same race problem but in
>>>> zap_pmd_range().
>>>>
>>>> The original autoNUMA code first clears PMD then sets it to protnone entry.
>>>> pmd_trans_huge() does not return TRUE because it saw cleared PMD, but
>>>> pmd_none_or_clear_bad() later saw the protnone entry and reported it as bad.
>>>> Is this the problem you are trying solve?
>>>>
>>>> Kirill's patch will pmdp_invalidate() the PMD entry, which keeps _PAGE_PSE bit,
>>>> so pmd_trans_huge() will return TRUE. In this case, it also fixes
>>>> your race problem in change_pmd_range().
>>>>
>>>> Let me know if I miss anything.
>>>>
>>>
>>> Ok, now I see. I think you're correct and I withdraw the patch.
>>
>> I have Kirrill's
>>
>> thp-reduce-indentation-level-in-change_huge_pmd.patch
>> thp-fix-madv_dontneed-vs-numa-balancing-race.patch
>> mm-drop-unused-pmdp_huge_get_and_clear_notify.patch
>> thp-fix-madv_dontneed-vs-madv_free-race.patch
>> thp-fix-madv_dontneed-vs-madv_free-race-fix.patch
>> thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch
>>
>> scheduled for 4.12-rc1. It sounds like
>> thp-fix-madv_dontneed-vs-madv_free-race.patch and
>> thp-fix-madv_dontneed-vs-madv_free-race.patch need to be boosted to
>> 4.11 and stable?
>
> thp-fix-madv_dontneed-vs-numa-balancing-race.patch is the fix for
> numa balancing problem reported in this thread.
>
> mm-drop-unused-pmdp_huge_get_and_clear_notify.patch,
> thp-fix-madv_dontneed-vs-madv_free-race.patch,
> thp-fix-madv_dontneed-vs-madv_free-race-fix.patch, and
> thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch
>
> are the fixes for other potential race problems similar to this one.
>
> I think it is better to have all these patches applied.

Yeah we should get all such fixes to stable IMHO (after review :). It's
not the first time that a fix for MADV_DONTNEED turned out to also fix a
race that involved "normal operation" with THP, without such syscalls.

> --
> Best Regards
> Yan Zi
>

2017-04-11 08:29:14

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

On Mon, Apr 10, 2017 at 03:09:03PM -0700, Andrew Morton wrote:
> On Mon, 10 Apr 2017 19:07:14 +0100 Mel Gorman <[email protected]> wrote:
>
> > On Mon, Apr 10, 2017 at 12:49:40PM -0500, Zi Yan wrote:
> > > On 10 Apr 2017, at 12:20, Mel Gorman wrote:
> > >
> > > > On Mon, Apr 10, 2017 at 11:45:08AM -0500, Zi Yan wrote:
> > > >>> While this could be fixed with heavy locking, it's only necessary to
> > > >>> make a copy of the PMD on the stack during change_pmd_range and avoid
> > > >>> races. A new helper is created for this as the check if quite subtle and the
> > > >>> existing similar helpful is not suitable. This passed 154 hours of testing
> > > >>> (usually triggers between 20 minutes and 24 hours) without detecting bad
> > > >>> PMDs or corruption. A basic test of an autonuma-intensive workload showed
> > > >>> no significant change in behaviour.
> > > >>>
> > > >>> Signed-off-by: Mel Gorman <[email protected]>
> > > >>> Cc: [email protected]
> > > >>
> > > >> Does this patch fix the same problem fixed by Kirill's patch here?
> > > >> https://lkml.org/lkml/2017/3/2/347
> > > >>
> > > >
> > > > I don't think so. The race I'm concerned with is due to locks not being
> > > > held and is in a different path.
> > >
> > > I do not agree. Kirill's patch is fixing the same race problem but in
> > > zap_pmd_range().
> > >
> > > The original autoNUMA code first clears PMD then sets it to protnone entry.
> > > pmd_trans_huge() does not return TRUE because it saw cleared PMD, but
> > > pmd_none_or_clear_bad() later saw the protnone entry and reported it as bad.
> > > Is this the problem you are trying solve?
> > >
> > > Kirill's patch will pmdp_invalidate() the PMD entry, which keeps _PAGE_PSE bit,
> > > so pmd_trans_huge() will return TRUE. In this case, it also fixes
> > > your race problem in change_pmd_range().
> > >
> > > Let me know if I miss anything.
> > >
> >
> > Ok, now I see. I think you're correct and I withdraw the patch.
>
> I have Kirrill's
>
> thp-reduce-indentation-level-in-change_huge_pmd.patch
> thp-fix-madv_dontneed-vs-numa-balancing-race.patch
> mm-drop-unused-pmdp_huge_get_and_clear_notify.patch
> thp-fix-madv_dontneed-vs-madv_free-race.patch
> thp-fix-madv_dontneed-vs-madv_free-race-fix.patch
> thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch
>
> scheduled for 4.12-rc1. It sounds like
> thp-fix-madv_dontneed-vs-madv_free-race.patch and
> thp-fix-madv_dontneed-vs-madv_free-race.patch need to be boosted to
> 4.11 and stable?

Arguably all of them deal with different classes of race. The first two
should be tagged for any stable kernel after 3.15 because that's the
only one I know for certain occurs in the field albeit not on a mainline
kernel. There will be conflicts on older kernels due to changes in the
PMD locking API and it'll be up to the tree maintainers and patch owners
if they want to backport or not.

--
Mel Gorman
SUSE Labs

2017-04-11 21:44:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

On Tue, 11 Apr 2017 08:35:02 +0200 Vlastimil Babka <[email protected]> wrote:

> >> I have Kirrill's
> >>
> >> thp-reduce-indentation-level-in-change_huge_pmd.patch
> >> thp-fix-madv_dontneed-vs-numa-balancing-race.patch
> >> mm-drop-unused-pmdp_huge_get_and_clear_notify.patch
> >> thp-fix-madv_dontneed-vs-madv_free-race.patch
> >> thp-fix-madv_dontneed-vs-madv_free-race-fix.patch
> >> thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch
> >>
> >> scheduled for 4.12-rc1. It sounds like
> >> thp-fix-madv_dontneed-vs-madv_free-race.patch and
> >> thp-fix-madv_dontneed-vs-madv_free-race.patch need to be boosted to
> >> 4.11 and stable?
> >
> > thp-fix-madv_dontneed-vs-numa-balancing-race.patch is the fix for
> > numa balancing problem reported in this thread.
> >
> > mm-drop-unused-pmdp_huge_get_and_clear_notify.patch,
> > thp-fix-madv_dontneed-vs-madv_free-race.patch,
> > thp-fix-madv_dontneed-vs-madv_free-race-fix.patch, and
> > thp-fix-madv_dontneed-vs-clear-soft-dirty-race.patch
> >
> > are the fixes for other potential race problems similar to this one.
> >
> > I think it is better to have all these patches applied.
>
> Yeah we should get all such fixes to stable IMHO (after review :). It's
> not the first time that a fix for MADV_DONTNEED turned out to also fix a
> race that involved "normal operation" with THP, without such syscalls.

The presence of thp-reduce-indentation-level-in-change_huge_pmd.patch
is a pain in the ass but I've decided to keep it rather than churning
all the patches at a late stage.