When we upgraded our kernel, we started seeing some page corruption like
the following consistently:
BUG: Bad page state in process ganesha.nfsd pfn:1304ca
page:0000000022261c55 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x1304ca
flags: 0x17ffffc0000000()
raw: 0017ffffc0000000 ffff8a513ffd4c98 ffffeee24b35ec08 0000000000000000
raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000
page dumped because: nonzero mapcount
CPU: 0 PID: 15567 Comm: ganesha.nfsd Kdump: loaded Tainted: P B O 5.10.158-1.nutanix.20221209.el7.x86_64 #1
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
Call Trace:
dump_stack+0x74/0x96
bad_page.cold+0x63/0x94
check_new_page_bad+0x6d/0x80
rmqueue+0x46e/0x970
get_page_from_freelist+0xcb/0x3f0
? _cond_resched+0x19/0x40
__alloc_pages_nodemask+0x164/0x300
alloc_pages_current+0x87/0xf0
skb_page_frag_refill+0x84/0x110
...
Sometimes, it would also show up as corruption in the free list pointer and
cause crashes.
After bisecting the issue, we found the issue started from e320d3012d25:
if (put_page_testzero(page))
free_the_page(page, order);
else if (!PageHead(page))
while (order-- > 0)
free_the_page(page + (1 << order), order);
So the problem is the check PageHead is racy because at this point we
already dropped our reference to the page. So even if we came in with
compound page, the page can already be freed and PageHead can return
false and we will end up freeing all the tail pages causing double free.
Fixes: e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages")
Cc: Andrew Morton <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Chunwei Chen <[email protected]>
---
mm/page_alloc.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0745aedebb37..3bb3484563ed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5631,9 +5631,12 @@ EXPORT_SYMBOL(get_zeroed_page);
*/
void __free_pages(struct page *page, unsigned int order)
{
+ /* get PageHead before we drop reference */
+ int head = PageHead(page);
+
if (put_page_testzero(page))
free_the_page(page, order);
- else if (!PageHead(page))
+ else if (!head)
while (order-- > 0)
free_the_page(page + (1 << order), order);
}
--
2.22.3
On Thu, Feb 09, 2023 at 05:48:28PM +0000, David Chen wrote:
> So the problem is the check PageHead is racy because at this point we
> already dropped our reference to the page. So even if we came in with
> compound page, the page can already be freed and PageHead can return
> false and we will end up freeing all the tail pages causing double free.
>
> Fixes: e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages")
> Cc: Andrew Morton <[email protected]>
> Cc: Matthew Wilcox (Oracle) <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
That's a pretty narrow window you managed to hit! Just a few
instructions wide. I guess that answers the question Andrew originally
had (Does this ever happen?) I just changed it from a silent memory
leak into a double-free.
> ---
> mm/page_alloc.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0745aedebb37..3bb3484563ed 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5631,9 +5631,12 @@ EXPORT_SYMBOL(get_zeroed_page);
> */
> void __free_pages(struct page *page, unsigned int order)
> {
> + /* get PageHead before we drop reference */
> + int head = PageHead(page);
> +
> if (put_page_testzero(page))
> free_the_page(page, order);
> - else if (!PageHead(page))
> + else if (!head)
> while (order-- > 0)
> free_the_page(page + (1 << order), order);
> }
> --
> 2.22.3
On 2/9/23 18:48, David Chen wrote:
> When we upgraded our kernel, we started seeing some page corruption like
> the following consistently:
>
> BUG: Bad page state in process ganesha.nfsd pfn:1304ca
> page:0000000022261c55 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x1304ca
> flags: 0x17ffffc0000000()
> raw: 0017ffffc0000000 ffff8a513ffd4c98 ffffeee24b35ec08 0000000000000000
> raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000
> page dumped because: nonzero mapcount
> CPU: 0 PID: 15567 Comm: ganesha.nfsd Kdump: loaded Tainted: P B O 5.10.158-1.nutanix.20221209.el7.x86_64 #1
> Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
> Call Trace:
> dump_stack+0x74/0x96
> bad_page.cold+0x63/0x94
> check_new_page_bad+0x6d/0x80
> rmqueue+0x46e/0x970
> get_page_from_freelist+0xcb/0x3f0
> ? _cond_resched+0x19/0x40
> __alloc_pages_nodemask+0x164/0x300
> alloc_pages_current+0x87/0xf0
> skb_page_frag_refill+0x84/0x110
> ...
>
> Sometimes, it would also show up as corruption in the free list pointer and
> cause crashes.
>
> After bisecting the issue, we found the issue started from e320d3012d25:
>
> if (put_page_testzero(page))
> free_the_page(page, order);
> else if (!PageHead(page))
> while (order-- > 0)
> free_the_page(page + (1 << order), order);
>
> So the problem is the check PageHead is racy because at this point we
> already dropped our reference to the page. So even if we came in with
> compound page, the page can already be freed and PageHead can return
> false and we will end up freeing all the tail pages causing double free.
>
> Fixes: e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages")
> Cc: Andrew Morton <[email protected]>
> Cc: Matthew Wilcox (Oracle) <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Chunwei Chen <[email protected]>
Reviewed-by: Vlastimil Babka <[email protected]>
That's nasty enough to go into 6.2, IMHO.
> ---
> mm/page_alloc.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0745aedebb37..3bb3484563ed 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5631,9 +5631,12 @@ EXPORT_SYMBOL(get_zeroed_page);
> */
> void __free_pages(struct page *page, unsigned int order)
> {
> + /* get PageHead before we drop reference */
> + int head = PageHead(page);
> +
> if (put_page_testzero(page))
> free_the_page(page, order);
> - else if (!PageHead(page))
> + else if (!head)
> while (order-- > 0)
> free_the_page(page + (1 << order), order);
> }
[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]
On 09.02.23 18:48, David Chen wrote:
> When we upgraded our kernel, we started seeing some page corruption like
> the following consistently:
>
> BUG: Bad page state in process ganesha.nfsd pfn:1304ca
> page:0000000022261c55 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x1304ca
> flags: 0x17ffffc0000000()
> raw: 0017ffffc0000000 ffff8a513ffd4c98 ffffeee24b35ec08 0000000000000000
> raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000
> page dumped because: nonzero mapcount
> CPU: 0 PID: 15567 Comm: ganesha.nfsd Kdump: loaded Tainted: P B O 5.10.158-1.nutanix.20221209.el7.x86_64 #1
> Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
> Call Trace:
> dump_stack+0x74/0x96
> bad_page.cold+0x63/0x94
> check_new_page_bad+0x6d/0x80
> rmqueue+0x46e/0x970
> get_page_from_freelist+0xcb/0x3f0
> ? _cond_resched+0x19/0x40
> __alloc_pages_nodemask+0x164/0x300
> alloc_pages_current+0x87/0xf0
> skb_page_frag_refill+0x84/0x110
> ...
>
> Sometimes, it would also show up as corruption in the free list pointer and
> cause crashes.
>
> After bisecting the issue, we found the issue started from e320d3012d25:
>
> if (put_page_testzero(page))
> free_the_page(page, order);
> else if (!PageHead(page))
> while (order-- > 0)
> free_the_page(page + (1 << order), order);
>
> So the problem is the check PageHead is racy because at this point we
> already dropped our reference to the page. So even if we came in with
> compound page, the page can already be freed and PageHead can return
> false and we will end up freeing all the tail pages causing double free.
>
> Fixes: e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages")
> Cc: Andrew Morton <[email protected]>
> Cc: Matthew Wilcox (Oracle) <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Chunwei Chen <[email protected]>
Thanks for the report and the patch. To be sure the issue doesn't fall
through the cracks unnoticed, I'm adding it to regzbot, the Linux kernel
regression tracking bot:
#regzbot ^introduced e320d3012d25
#regzbot title mm: page corruption caused by racy check in __free_pages
#regzbot ignore-activity
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.