2011-03-14 08:09:04

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!

THP's collapse_huge_page() has an understandable but ugly difference
in when its huge page is allocated: inside if NUMA but outside if not.
It's hardly surprising that the memcg failure path forgot that, freeing
the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
(or even worse, using the freed page).

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/huge_memory.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

--- 2.6.38-rc8/mm/huge_memory.c 2011-03-08 09:27:16.000000000 -0800
+++ linux/mm/huge_memory.c 2011-03-13 18:26:21.000000000 -0700
@@ -1762,6 +1762,10 @@ static void collapse_huge_page(struct mm
#ifndef CONFIG_NUMA
VM_BUG_ON(!*hpage);
new_page = *hpage;
+ if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
+ up_read(&mm->mmap_sem);
+ return;
+ }
#else
VM_BUG_ON(*hpage);
/*
@@ -1781,12 +1785,12 @@ static void collapse_huge_page(struct mm
*hpage = ERR_PTR(-ENOMEM);
return;
}
-#endif
if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
up_read(&mm->mmap_sem);
put_page(new_page);
return;
}
+#endif

/* after allocating the hugepage upgrade to mmap_sem write mode */
up_read(&mm->mmap_sem);


2011-03-14 15:25:35

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!

On Mon, Mar 14, 2011 at 5:08 PM, Hugh Dickins <[email protected]> wrote:
> THP's collapse_huge_page() has an understandable but ugly difference
> in when its huge page is allocated: inside if NUMA but outside if not.
> It's hardly surprising that the memcg failure path forgot that, freeing
> the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
> (or even worse, using the freed page).
>
> Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

Thanks, Hugh.
--
Kind regards,
Minchan Kim

2011-03-14 15:52:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!

On Mon, Mar 14, 2011 at 01:08:47AM -0700, Hugh Dickins wrote:
> mm/huge_memory.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> --- 2.6.38-rc8/mm/huge_memory.c 2011-03-08 09:27:16.000000000 -0800
> +++ linux/mm/huge_memory.c 2011-03-13 18:26:21.000000000 -0700
> @@ -1762,6 +1762,10 @@ static void collapse_huge_page(struct mm
> #ifndef CONFIG_NUMA
> VM_BUG_ON(!*hpage);
> new_page = *hpage;
> + if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
> + up_read(&mm->mmap_sem);
> + return;
> + }
> #else
> VM_BUG_ON(*hpage);
> /*
> @@ -1781,12 +1785,12 @@ static void collapse_huge_page(struct mm
> *hpage = ERR_PTR(-ENOMEM);
> return;
> }
> -#endif
> if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
> up_read(&mm->mmap_sem);
> put_page(new_page);
> return;
> }
> +#endif
>
> /* after allocating the hugepage upgrade to mmap_sem write mode */
> up_read(&mm->mmap_sem);

Correct! I'd suggest to fix it without duplicating the
mem_cgroup_newpage_charge. It's just one more put_page than needed
with memcg enabled and NUMA disabled (I guess most memcg testing
happened with NUMA enabled). The larger diff likely rejects with -mm
NUMA code... I'll try to diff it with a smaller -U2 (this code has
little change to be misplaced) that may allow it to apply clean
regardless of the merging order, so it may make life easier.

It may have been overkill to keep the NUMA case separated in order to
avoid spurious allocations for the not-NUMA case, code become more
complex and I'm not sure if it's really worthwhile. The optimization
makes sense but it's minor and it created more complexity than
strictly needed. For now we can't change it in the short term as it
has been tested this way, but if people dislike the additional
complexity that this optimization created, I'm not against dropping it
in the future. Your comment was positive about the optimization (you
said understandable) so I wanted to share my current thinking on these
ifdefs...

Thanks,
Andrea

===
Subject: thp+memcg-numa: fix BUG at include/linux/mm.h:370!

From: Hugh Dickins <[email protected]>

THP's collapse_huge_page() has an understandable but ugly difference
in when its huge page is allocated: inside if NUMA but outside if not.
It's hardly surprising that the memcg failure path forgot that, freeing
the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
(or even worse, using the freed page).

Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dbe99a5..bf41114 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1785,5 +1785,7 @@ static void collapse_huge_page(struct mm_struct *mm,
if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
up_read(&mm->mmap_sem);
+#ifdef CONFIG_NUMA
put_page(new_page);
+#endif
return;
}

2011-03-14 16:38:02

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!

On Mon, 14 Mar 2011, Andrea Arcangeli wrote:
>
> Correct! I'd suggest to fix it without duplicating the
> mem_cgroup_newpage_charge. It's just one more put_page than needed
> with memcg enabled and NUMA disabled (I guess most memcg testing
> happened with NUMA enabled). The larger diff likely rejects with -mm
> NUMA code... I'll try to diff it with a smaller -U2 (this code has
> little change to be misplaced) that may allow it to apply clean
> regardless of the merging order, so it may make life easier.

I did try it that way at first (didn't help when I mistakenly put
#ifndef instead of #ifdef around the put_page!), but was repulsed
by seeing yet another #ifdef CONFIG_NUMA, so went with the duplicating
version - which Linus has now taken.

>
> It may have been overkill to keep the NUMA case separated in order to
> avoid spurious allocations for the not-NUMA case, code become more
> complex and I'm not sure if it's really worthwhile. The optimization
> makes sense but it's minor and it created more complexity than
> strictly needed. For now we can't change it in the short term as it
> has been tested this way, but if people dislike the additional
> complexity that this optimization created, I'm not against dropping it
> in the future. Your comment was positive about the optimization (you
> said understandable) so I wanted to share my current thinking on these
> ifdefs...

I was certainly tempted to remove all the non-NUMA cases, but as you
say, now is not the right time for that since we needed a quick bugfix.

I do appreciate why you did it that way, it is nicer to be allocating
on the outside, though unsuitable in the NUMA case. But given how this
bug has passed unnoticed for so long, it seems like nobody has been
testing non-NUMA, so yes, best to simplify and make non-NUMA do the
same as NUMA in 2.6.39.

Since Linus has taken my version that you didn't like, perhaps you can
get even by sending him your "mm: PageBuddy cleanups" patch, the version
I didn't like (for its silly branches) so was reluctant to push myself.

I'd really like to see that fix in, since it's a little hard to argue
for in -stable, being all about a system which is already unstable.
But I think it needs a stronger title than "PageBuddy cleanups" -
"fix BUG in bad_page()"?

Hugh

2011-03-14 16:57:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!

On Mon, Mar 14, 2011 at 9:37 AM, Hugh Dickins <[email protected]> wrote:
>
> I did try it that way at first (didn't help when I mistakenly put
> #ifndef instead of #ifdef around the put_page!), but was repulsed
> by seeing yet another #ifdef CONFIG_NUMA, so went with the duplicating
> version - which Linus has now taken.

I have to admit to being repulsed by the whole patch, but my main
source of "that's effin ugly" was from the crazy lock handling.

Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And
if not, why not release the read-lock early? And even if it _does_
need it, why not do

ret = mem_cgroup_newpage_charge();
up_read(&mm->mmap_sem);
if (ret) {
...

finally, the #ifdef CONFIG_NUMA is ugly, but it's ugly in the return
path of the function too, and the nicer way would probably be to have
it in one place and do something like

/*
* The allocation rules are different for the NUMA/non-NUMA cases
* For the NUMA case, we allocate here, for the non-numa case we
* use the allocation in *hpage
*/
static inline struct page *collapse_alloc_hugepage(struct page **hpage)
{
#ifdef CONFIG_NUMA
VM_BUG_ON(*hpage);
return alloc_hugepage_vma(khugepaged_defrag(), vma, address, node);
#else
VM_BUG_ON(!*hpage);
return *hpage;
#endif
}

static inline void collapse_free_hugepage(struct page *page)
{
#ifdef CONFIG_NUMA
put_page(new_page);
#else
/* Nothing to do */
#endif
}

and use that instead. The point being that the #ifdef'fery now ends up
being in a much more targeted area and much better abstracted, rather
than in the middle of code, and ugly as sin.

But as mentioned, the lock handling is disgusting. Why is it even safe
to drop and re-take the lock at all?

Linus

2011-03-14 17:00:00

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH] mm: PageBuddy and mapcount underflows robustness

Hi Hugh,

On Mon, Mar 14, 2011 at 09:37:43AM -0700, Hugh Dickins wrote:
> version - which Linus has now taken.

That was fast, so that solves all merging order rejects in the first
place as we'll all rebase on upstraem.

> I was certainly tempted to remove all the non-NUMA cases, but as you
> say, now is not the right time for that since we needed a quick bugfix.

Correct. Too much risk in making the not-NUMA case as the NUMA case.

> I do appreciate why you did it that way, it is nicer to be allocating

BTW, one reason it was done this way is also because the not-NUMA case
was the original code, and when I added the NUMA awareness to
khugepaged I didn't want to risk regressions to the existing case that
I knew worked fine.

> on the outside, though unsuitable in the NUMA case. But given how this
> bug has passed unnoticed for so long, it seems like nobody has been
> testing non-NUMA, so yes, best to simplify and make non-NUMA do the
> same as NUMA in 2.6.39.

These days clearly the NUMA case gets more testing than the not-NUMA
case. Especially for features like memcg that are mostly needed on
NUMA systems and not so much on small laptops or something not
NUMA.

I'm unsure if it's so urgent to remove it, maybe a little benchmarking
with khugepaged at 100% load may be worth trying first, but if there's
no real change in the frequency increase of khugepaged/pages_collapsed
counter, I'm surely ok if it gets removed later.

> Since Linus has taken my version that you didn't like, perhaps you can

I'm ok with your version... no problem.

> get even by sending him your "mm: PageBuddy cleanups" patch, the version
> I didn't like (for its silly branches) so was reluctant to push myself.

Ok that slipped even my own aa.git tree as it was more a RFC when I
posted it and it didn't fix a real bug but just made the code more
robust at runtime in case of hardware memory corruption or software
bugs.

> I'd really like to see that fix in, since it's a little hard to argue
> for in -stable, being all about a system which is already unstable.
>
> But I think it needs a stronger title than "PageBuddy cleanups" -
> "fix BUG in bad_page()"?

I think this can comfortably wait 2.6.39-rc. I think it's best if
Andrew merges it so it gets digested through -mm for a while. But let
me know if you prefer something else. So here it is with slightly
updated header.

===
Subject: mm: PageBuddy and mapcount underflows robustness

From: Andrea Arcangeli <[email protected]>

bad_page could VM_BUG_ON(!PageBuddy(page)) inside __ClearPageBuddy(). I prefer
to keep the VM_BUG_ON for safety and to add a if to solve it.

Change the _mapcount value indicating PageBuddy from -2 to -1024*1024 for more
robusteness against page_mapcount() underflows.

Signed-off-by: Andrea Arcangeli <[email protected]>
Reported-by: Hugh Dickins <[email protected]>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f6385fc..fa16ba0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -402,16 +402,22 @@ static inline void init_page_count(struct page *page)
/*
* PageBuddy() indicate that the page is free and in the buddy system
* (see mm/page_alloc.c).
+ *
+ * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to
+ * -2 so that an underflow of the page_mapcount() won't be mistaken
+ * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE.
*/
+#define PAGE_BUDDY_MAPCOUNT_VALUE (-1024*1024)
+
static inline int PageBuddy(struct page *page)
{
- return atomic_read(&page->_mapcount) == -2;
+ return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE;
}

static inline void __SetPageBuddy(struct page *page)
{
VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
- atomic_set(&page->_mapcount, -2);
+ atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE);
}

static inline void __ClearPageBuddy(struct page *page)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a873e61..8aac134 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -286,7 +286,9 @@ static void bad_page(struct page *page)

/* Don't complain about poisoned pages */
if (PageHWPoison(page)) {
- __ClearPageBuddy(page);
+ /* __ClearPageBuddy VM_BUG_ON(!PageBuddy(page)) */
+ if (PageBuddy(page))
+ __ClearPageBuddy(page);
return;
}

@@ -317,7 +319,8 @@ static void bad_page(struct page *page)
dump_stack();
out:
/* Leave bad fields for debug, except PageBuddy could make trouble */
- __ClearPageBuddy(page);
+ if (PageBuddy(page)) /* __ClearPageBuddy VM_BUG_ON(!PageBuddy(page)) */
+ __ClearPageBuddy(page);
add_taint(TAINT_BAD_PAGE);
}

2011-03-14 17:18:09

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!

Hello,

On Mon, Mar 14, 2011 at 09:56:10AM -0700, Linus Torvalds wrote:
> On Mon, Mar 14, 2011 at 9:37 AM, Hugh Dickins <[email protected]> wrote:
> >
> > I did try it that way at first (didn't help when I mistakenly put
> > #ifndef instead of #ifdef around the put_page!), but was repulsed
> > by seeing yet another #ifdef CONFIG_NUMA, so went with the duplicating
> > version - which Linus has now taken.
>
> I have to admit to being repulsed by the whole patch, but my main
> source of "that's effin ugly" was from the crazy lock handling.
>
> Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And
> if not, why not release the read-lock early? And even if it _does_
> need it, why not do
>

The mmap_sem is needed for the page allocation, or the "vma" can
become a dangling pointer (the vma is passed to alloc_hugepage_vma).

> ret = mem_cgroup_newpage_charge();
> up_read(&mm->mmap_sem);
> if (ret) {
> ...
>
> finally, the #ifdef CONFIG_NUMA is ugly, but it's ugly in the return
> path of the function too, and the nicer way would probably be to have
> it in one place and do something like
>
> /*
> * The allocation rules are different for the NUMA/non-NUMA cases
> * For the NUMA case, we allocate here, for the non-numa case we
> * use the allocation in *hpage
> */
> static inline struct page *collapse_alloc_hugepage(struct page **hpage)
> {
> #ifdef CONFIG_NUMA
> VM_BUG_ON(*hpage);
> return alloc_hugepage_vma(khugepaged_defrag(), vma, address, node);
> #else
> VM_BUG_ON(!*hpage);
> return *hpage;
> #endif
> }
>
> static inline void collapse_free_hugepage(struct page *page)
> {
> #ifdef CONFIG_NUMA
> put_page(new_page);
> #else
> /* Nothing to do */
> #endif
> }
>
> and use that instead. The point being that the #ifdef'fery now ends up
> being in a much more targeted area and much better abstracted, rather
> than in the middle of code, and ugly as sin.

Agreed about the cleanup. I didn't add a new function for it like
probably I should have to make it less ugly...

About mem_cgroup_newpage_charge I think you're right it won't need the
mmap_sem. Running it under it is sure safe. But if it's not needed we
can move the up_read before the mem_cgroup_newpage_charge like you
suggested. Johannes/Minchan could you confirm the mmap_sem isn't
needed around mem_cgroup_newpage_charge? The mm and new_page are
stable without the mmap_sem, only the vma goes away but the memcg
shouldn't care.

> But as mentioned, the lock handling is disgusting. Why is it even safe
> to drop and re-take the lock at all?

We have to upgrade the rwsem from read to write (hugepmd isn't allowed
to materialize under code that runs with the mmap_sem read mode,
that's one of the invariants to be safe when split_huge_page_pmd does
nothing and let the regular pte walk go ahead). It is safe because we
revalidate the vma after dropping the read lock and taking the write
lock. It's generally impossible to upgrade the lock without first
dropping it if more than one thread does that in parallel on the same
mm (they all hold the read lock so somebody has to drop the read lock
and revalidate before anyone has a chance to take the write lock). Now
interestingly I notice that in this very case khugepaged is single
threaded and no other places would call upgrade_read() on the mmap_sem
anywhere, so it probably would be safe, but there's no method for that
(because it'd need to be called at most by one thread at once on the
same mm and that's probably not considered an useful case, even if it
probably would be in collapse_huge_page). If we ever thread khugepaged
to run on more than one core then we'd be forced to drop the lock too
(unless we make the scan on the same mm mutually exclusive which isn't
the best for large mm anyway). But I exclude we ever need to thread
khugepaged though, it's blazing fast (unlike the ksmd scan). So if we
implment an upgrade_read() we might be able to remove a find_vma in
collapse_huge_page. It's not very important to optimize though as the
memory copy should be the biggest cost anyway.

2011-03-14 17:37:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] mm: PageBuddy and mapcount underflows robustness

On Mon, Mar 14, 2011 at 9:59 AM, Andrea Arcangeli <[email protected]> wrote:
>
> +#define PAGE_BUDDY_MAPCOUNT_VALUE (-1024*1024)

I realize that this is a nitpick, but from a code generation
standpoint, large random constants like these are just nasty.

I would suggest aiming for constants that are easy to generate and/or
fit better in the code stream. In many encoding schemes (eg x86), -128
is much easier to generate, since it fits in a signed byte and allows
small instructions etc. And in most RISC encodings, 8- or 16-bit
constants can be encoded much more easily than something like your
current one, and bigger ones often end up resulting in a load from
memory or at least several immediate-building instructions.

> - ? ? ? __ClearPageBuddy(page);
> + ? ? ? if (PageBuddy(page)) /* __ClearPageBuddy VM_BUG_ON(!PageBuddy(page)) */
> + ? ? ? ? ? ? ? __ClearPageBuddy(page);

Also, this is just disgusting. It adds no safety here to have that
VM_BUG_ON(), so it's just unnecessary code generation to do this.
Also, we don't even WANT to do that stupid "__ClearPageBuddy()" in the
first place! What those two code-sites actually want are just a simple

reset_page_mapcount(page);

which does the right thing in _general_, and not just for the buddy
case - we want to reset the mapcount for other reasons than just
pagebuddy (ie the underflow/overflow case).

And it avoids the VM_BUG_ON() too, making the crazy conditionals be not needed.

No?

Linus

2011-03-14 19:58:57

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!

On Mon, Mar 14, 2011 at 06:17:31PM +0100, Andrea Arcangeli wrote:
> On Mon, Mar 14, 2011 at 09:56:10AM -0700, Linus Torvalds wrote:
> > Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And
> > if not, why not release the read-lock early? And even if it _does_
> > need it, why not do

[...]

> About mem_cgroup_newpage_charge I think you're right it won't need the
> mmap_sem. Running it under it is sure safe. But if it's not needed we
> can move the up_read before the mem_cgroup_newpage_charge like you
> suggested. Johannes/Minchan could you confirm the mmap_sem isn't
> needed around mem_cgroup_newpage_charge? The mm and new_page are
> stable without the mmap_sem, only the vma goes away but the memcg
> shouldn't care.

We don't care about the vma. It's all about assigning the physical
page to the memcg that mm->owner belongs to.

It would be the first callsite not holding the mmap_sem, but that is
only because all existing sites are fault handlers that don't drop the
lock for other reasons.

I am not aware of anything that would rely on the lock in there, or
would not deserve to break if it did.

2011-03-14 23:49:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!

On Mon, 14 Mar 2011 20:58:23 +0100
Johannes Weiner <[email protected]> wrote:

> On Mon, Mar 14, 2011 at 06:17:31PM +0100, Andrea Arcangeli wrote:
> > On Mon, Mar 14, 2011 at 09:56:10AM -0700, Linus Torvalds wrote:
> > > Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And
> > > if not, why not release the read-lock early? And even if it _does_
> > > need it, why not do
>
> [...]
>
> > About mem_cgroup_newpage_charge I think you're right it won't need the
> > mmap_sem. Running it under it is sure safe. But if it's not needed we
> > can move the up_read before the mem_cgroup_newpage_charge like you
> > suggested. Johannes/Minchan could you confirm the mmap_sem isn't
> > needed around mem_cgroup_newpage_charge? The mm and new_page are
> > stable without the mmap_sem, only the vma goes away but the memcg
> > shouldn't care.
>
> We don't care about the vma. It's all about assigning the physical
> page to the memcg that mm->owner belongs to.
>
> It would be the first callsite not holding the mmap_sem, but that is
> only because all existing sites are fault handlers that don't drop the
> lock for other reasons.
>
> I am not aware of anything that would rely on the lock in there, or
> would not deserve to break if it did.
>

mmap_sem is not required to held if uncharge() operation is done
if vma turns out to be a stale pointer.

Thanks,
-Kame

2011-03-17 23:16:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: PageBuddy and mapcount underflows robustness

On Mon, Mar 14, 2011 at 10:30:11AM -0700, Linus Torvalds wrote:
> On Mon, Mar 14, 2011 at 9:59 AM, Andrea Arcangeli <[email protected]> wrote:
> >
> > +#define PAGE_BUDDY_MAPCOUNT_VALUE (-1024*1024)
>
> I realize that this is a nitpick, but from a code generation
> standpoint, large random constants like these are just nasty.
>
> I would suggest aiming for constants that are easy to generate and/or
> fit better in the code stream. In many encoding schemes (eg x86), -128
> is much easier to generate, since it fits in a signed byte and allows
> small instructions etc. And in most RISC encodings, 8- or 16-bit
> constants can be encoded much more easily than something like your
> current one, and bigger ones often end up resulting in a load from
> memory or at least several immediate-building instructions.

-128 sure is ok with me. It's extremely unlikely to be off by 127
times, if we're off by more than 127 times we're still going to detect
the error.

> Also, this is just disgusting. It adds no safety here to have that
> VM_BUG_ON(), so it's just unnecessary code generation to do this.
> Also, we don't even WANT to do that stupid "__ClearPageBuddy()" in the
> first place! What those two code-sites actually want are just a simple
>
> reset_page_mapcount(page);
>
> which does the right thing in _general_, and not just for the buddy
> case - we want to reset the mapcount for other reasons than just
> pagebuddy (ie the underflow/overflow case).

Agreed.

> And it avoids the VM_BUG_ON() too, making the crazy conditionals be not needed.

Well using reset_page_mapcount in the two error sites, didn't require
me to remove the VM_BUG_ON from __ClearPageBuddy so I left it
there...

===
Subject: mm: PageBuddy and mapcount robustness

From: Andrea Arcangeli <[email protected]>

Change the _mapcount value indicating PageBuddy from -2 to -128 for
more robusteness against page_mapcount() undeflows.

Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
ignore the previous retval of PageBuddy().

Signed-off-by: Andrea Arcangeli <[email protected]>
Reported-by: Hugh Dickins <[email protected]>
---
include/linux/mm.h | 11 +++++++++--
mm/page_alloc.c | 4 ++--
2 files changed, 11 insertions(+), 4 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -402,16 +402,23 @@ static inline void init_page_count(struc
/*
* PageBuddy() indicate that the page is free and in the buddy system
* (see mm/page_alloc.c).
+ *
+ * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to
+ * -2 so that an underflow of the page_mapcount() won't be mistaken
+ * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. -128 can be created very
+ * efficiently by most CPU architectures.
*/
+#define PAGE_BUDDY_MAPCOUNT_VALUE (-128)
+
static inline int PageBuddy(struct page *page)
{
- return atomic_read(&page->_mapcount) == -2;
+ return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE;
}

static inline void __SetPageBuddy(struct page *page)
{
VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
- atomic_set(&page->_mapcount, -2);
+ atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE);
}

static inline void __ClearPageBuddy(struct page *page)
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -286,7 +286,7 @@ static void bad_page(struct page *page)

/* Don't complain about poisoned pages */
if (PageHWPoison(page)) {
- __ClearPageBuddy(page);
+ reset_page_mapcount(page); /* remove PageBuddy */
return;
}

@@ -317,7 +317,7 @@ static void bad_page(struct page *page)
dump_stack();
out:
/* Leave bad fields for debug, except PageBuddy could make trouble */
- __ClearPageBuddy(page);
+ reset_page_mapcount(page); /* remove PageBuddy */
add_taint(TAINT_BAD_PAGE);
}

2011-03-18 21:34:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm: PageBuddy and mapcount underflows robustness

On Fri, 18 Mar 2011, Andrea Arcangeli wrote:
> On Mon, Mar 14, 2011 at 10:30:11AM -0700, Linus Torvalds wrote:
> Subject: mm: PageBuddy and mapcount robustness
>
> From: Andrea Arcangeli <[email protected]>
>
> Change the _mapcount value indicating PageBuddy from -2 to -128 for
> more robusteness against page_mapcount() undeflows.
>
> Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
> ignore the previous retval of PageBuddy().
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> Reported-by: Hugh Dickins <[email protected]>

Yes, this version satisfies my objections too.
I'd say Acked-by but I see that it's already in, great.

I've Cc'ed [email protected]: please can we have this in 2.6.38.1,
since 2.6.38 regressed the recovery from bad page states,
inadvertently changing them to a fatal error when CONFIG_DEBUG_VM.

Thanks,
Hugh

commit ef2b4b95a63a1d23958dcb99eb2c6898eddc87d0
Author: Andrea Arcangeli <[email protected]>
Date: Fri Mar 18 00:16:35 2011 +0100

mm: PageBuddy and mapcount robustness

Change the _mapcount value indicating PageBuddy from -2 to -128 for
more robusteness against page_mapcount() undeflows.

Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
ignore the previous retval of PageBuddy().

Signed-off-by: Andrea Arcangeli <[email protected]>
Reported-by: Hugh Dickins <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 679300c..ff83798 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -402,16 +402,23 @@ static inline void init_page_count(struct page *page)
/*
* PageBuddy() indicate that the page is free and in the buddy system
* (see mm/page_alloc.c).
+ *
+ * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to
+ * -2 so that an underflow of the page_mapcount() won't be mistaken
+ * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. -128 can be created very
+ * efficiently by most CPU architectures.
*/
+#define PAGE_BUDDY_MAPCOUNT_VALUE (-128)
+
static inline int PageBuddy(struct page *page)
{
- return atomic_read(&page->_mapcount) == -2;
+ return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE;
}

static inline void __SetPageBuddy(struct page *page)
{
VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
- atomic_set(&page->_mapcount, -2);
+ atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE);
}

static inline void __ClearPageBuddy(struct page *page)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bd76256..7945247 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -286,7 +286,7 @@ static void bad_page(struct page *page)

/* Don't complain about poisoned pages */
if (PageHWPoison(page)) {
- __ClearPageBuddy(page);
+ reset_page_mapcount(page); /* remove PageBuddy */
return;
}

@@ -317,7 +317,7 @@ static void bad_page(struct page *page)
dump_stack();
out:
/* Leave bad fields for debug, except PageBuddy could make trouble */
- __ClearPageBuddy(page);
+ reset_page_mapcount(page); /* remove PageBuddy */
add_taint(TAINT_BAD_PAGE);
}

2011-03-23 23:16:27

by Greg KH

[permalink] [raw]
Subject: Re: [stable] [PATCH] mm: PageBuddy and mapcount underflows robustness

On Fri, Mar 18, 2011 at 02:34:03PM -0700, Hugh Dickins wrote:
> On Fri, 18 Mar 2011, Andrea Arcangeli wrote:
> > On Mon, Mar 14, 2011 at 10:30:11AM -0700, Linus Torvalds wrote:
> > Subject: mm: PageBuddy and mapcount robustness
> >
> > From: Andrea Arcangeli <[email protected]>
> >
> > Change the _mapcount value indicating PageBuddy from -2 to -128 for
> > more robusteness against page_mapcount() undeflows.
> >
> > Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
> > ignore the previous retval of PageBuddy().
> >
> > Signed-off-by: Andrea Arcangeli <[email protected]>
> > Reported-by: Hugh Dickins <[email protected]>
>
> Yes, this version satisfies my objections too.
> I'd say Acked-by but I see that it's already in, great.
>
> I've Cc'ed [email protected]: please can we have this in 2.6.38.1,
> since 2.6.38 regressed the recovery from bad page states,
> inadvertently changing them to a fatal error when CONFIG_DEBUG_VM.

Now queued up for 2.6.38.2, thanks.

greg k-h

2011-04-01 21:22:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!

On Mon, Mar 14, 2011 at 08:58:23PM +0100, Johannes Weiner wrote:
> We don't care about the vma. It's all about assigning the physical
> page to the memcg that mm->owner belongs to.
>
> It would be the first callsite not holding the mmap_sem, but that is
> only because all existing sites are fault handlers that don't drop the
> lock for other reasons.

I was afraid it'd be the first callsite this is why I wasn't excited
to push it in 2.6.38, but Linus's right and we should micro-optimize
it for 2.6.39.

> I am not aware of anything that would rely on the lock in there, or
> would not deserve to break if it did.

Thanks for double checking.

What about this? (only problem is the thp-vmstat patch in -mm then
reject, maybe I should rediff it against -mm instead, as you wish)

===
Subject: thp: optimize memcg charge in khugepaged

From: Andrea Arcangeli <[email protected]>

We don't need to hold the mmmap_sem through mem_cgroup_newpage_charge(), the
mmap_sem is only hold for keeping the vma stable and we don't need the vma
stable anymore after we return from alloc_hugepage_vma().

Signed-off-by: Andrea Arcangeli <[email protected]>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0a619e0..c61d9ad 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1762,12 +1762,9 @@ static void collapse_huge_page(struct mm_struct *mm,

VM_BUG_ON(address & ~HPAGE_PMD_MASK);
#ifndef CONFIG_NUMA
+ up_read(&mm->mmap_sem);
VM_BUG_ON(!*hpage);
new_page = *hpage;
- if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
- up_read(&mm->mmap_sem);
- return;
- }
#else
VM_BUG_ON(*hpage);
/*
@@ -1782,20 +1779,20 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
node, __GFP_OTHER_NODE);
+ /* after allocating the hugepage upgrade to mmap_sem write mode */
+ up_read(&mm->mmap_sem);
if (unlikely(!new_page)) {
- up_read(&mm->mmap_sem);
*hpage = ERR_PTR(-ENOMEM);
return;
}
+#endif
+
if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
- up_read(&mm->mmap_sem);
+#ifdef CONFIG_NUMA
put_page(new_page);
+#endif
return;
}
-#endif
-
- /* after allocating the hugepage upgrade to mmap_sem write mode */
- up_read(&mm->mmap_sem);

/*
* Prevent all access to pagetables with the exception of