LinuxLists.cc - [RFC] linux-next panic in hugepage_subpool_put

2021-02-23 15:04:00

Subject: [RFC] linux-next panic in hugepage_subpool_put_pages()

Hi,

LTP triggered a panic on s390 in hugepage_subpool_put_pages() with
linux-next 5.12.0-20210222, see below.

It crashes on the spin_lock(&spool->lock) at the beginning, because the
passed-in *spool points to 0000004e00000000, which is not addressable
memory. It rather looks like some flags and not a proper address. I suspect
some relation to the recent rework in that area, e.g. commit f1280272ae4d
("hugetlb: use page.private for hugetlb specific page flags").

__free_huge_page() calls hugepage_subpool_put_pages() and takes *spool from
hugetlb_page_subpool(page), which was changed by that commit to use
page[1]->private now.

What I do not understand is how __free_huge_page() would be called at all
in the call trace below (set_max_huge_pages -> alloc_pool_huge_page ->
__free_huge_page -> hugepage_subpool_put_pages). From the code it seems
that this should not be possible, so I must be missing something.

BTW, the workload of the failing LTP test futex_wake04 is not really
related, the crash happens already during environment setup when it
writes to /proc/sys/vm/nr_hugepages, before starting its actual workload.
It is very hard to reproduce though, so I could not do a proper bisect
so far. Running futex_wake04 alone also never triggered it, I only hit
it very rarely when it was run with ./runltp -f syscalls. Looks like
some kind of race.

[ 4013.255629] LTP: starting futex_wake04
[ 4013.299623] futex_wake04 (865156): drop_caches: 3
[ 4013.300137] Unable to handle kernel pointer dereference in virtual kernel address space
[ 4013.300140] Failing address: 0000004e00000000 TEID: 0000004e00000403
[ 4013.300143] Fault in home space mode while using kernel ASCE.
[ 4013.300148] AS:000000014e264007 R3:0000000000000024
[ 4013.300172] Oops: 003b ilc:2 [#1] SMP
[ 4013.300177] Modules linked in: quota_v2 quota_tree tun overlay ntfs exfat vfat fat loop sctp udp_tunnel ip6_udp_tunnel vfio_pci irqbypass vfio_virqfd dm_service_time scsi_debug vfio_ap kvm vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio eadm_sch sch_fq_codel configfs ip_tables x_tables ghash_s390 prng aes_s390 des_s390 libdes sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4 [last unloaded: init_module]
[ 4013.300273] CPU: 4 PID: 865156 Comm: futex_wake04 Tainted: G OE 5.12.0-20210222.rc0.git0.abaf6f60176f.300.fc33.s390x+next #1
[ 4013.300278] Hardware name: IBM 2827 H43 738 (LPAR)
[ 4013.300280] Krnl PSW : 0704c00180000000 000000014d00c36c (hugepage_subpool_put_pages.part.0+0x2c/0x138)
[ 4013.300302] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[ 4013.300307] Krnl GPRS: 0000000000000000 0000000000000000 0000004e00000000 0000000000000005
[ 4013.300311] 0000000000001000 0000037203deffc0 00000380092a7b80 0000000100000000
[ 4013.300314] 3ffff00000010000 000000014e2db208 0000000000000001 0000004e00000000
[ 4013.300317] 0000000081b81f00 000000014d88fde0 00000380092a7a30 00000380092a79e8
[ 4013.300327] Krnl Code: 000000014d00c35e: e3e0f0980024 stg %r14,152(%r15)
000000014d00c364: a7180000 lhi %r1,0
#000000014d00c368: 583003ac l %r3,940
>000000014d00c36c: ba132000 cs %r1,%r3,0(%r2)
000000014d00c370: a7740076 brc 7,000000014d00c45c
000000014d00c374: e310b0100004 lg %r1,16(%r11)
000000014d00c37a: a71fffff cghi %r1,-1
000000014d00c37e: a784000a brc 8,000000014d00c392
[ 4013.300351] Call Trace:
[ 4013.300353] [<000000014d00c36c>] hugepage_subpool_put_pages.part.0+0x2c/0x138
[ 4013.300358] [<000000014d00c546>] __free_huge_page+0xce/0x310
[ 4013.300361] [<000000014d00a45a>] alloc_pool_huge_page+0x102/0x120
[ 4013.300365] [<000000014d00b2ae>] set_max_huge_pages+0x13e/0x350
[ 4013.300368] [<000000014d00b728>] hugetlb_sysctl_handler_common+0xd8/0x110
[ 4013.300372] [<000000014d00dc20>] hugetlb_sysctl_handler+0x48/0x58
[ 4013.300376] [<000000014d12bd08>] proc_sys_call_handler+0x138/0x238
[ 4013.300381] [<000000014d058ade>] new_sync_write+0x10e/0x198
[ 4013.300386] [<000000014d059644>] vfs_write.part.0+0x12c/0x238
[ 4013.300390] [<000000014d05b8b8>] ksys_write+0x68/0xf8
[ 4013.300394] [<000000014cd71eda>] do_syscall+0x82/0xd0
[ 4013.300399] [<000000014d8671bc>] __do_syscall+0xb4/0xc8
[ 4013.300404] [<000000014d87304a>] system_call+0x72/0x98
[ 4013.300409] Last Breaking-Event-Address:
[ 4013.300410] [<040000008a220f00>] 0x40000008a220f00

2021-02-23 16:49:29

by Gerald Schaefer

[permalink] [raw]

Subject: Re: [RFC] linux-next panic in hugepage_subpool_put_pages()

On Tue, 23 Feb 2021 15:57:40 +0100
Gerald Schaefer <[email protected]> wrote:

[...]
> What I do not understand is how __free_huge_page() would be called at all
> in the call trace below (set_max_huge_pages -> alloc_pool_huge_page ->
> __free_huge_page -> hugepage_subpool_put_pages). From the code it seems
> that this should not be possible, so I must be missing something.

Ok, looking more closely at the dump and code, I see that __free_huge_page()
was called via alloc_pool_huge_page -> put_page() -> destroy_compound_page()
-> compound_page_dtors[2].

It doesn't feel right that alloc_pool_huge_page() ends up freeing the
newly allocated page again. Maybe some refcounting race, so that put_page()
wrongly assumes it was the last reference?

Also from the dump, I could reconstruct the (head) struct page pointer
that __free_huge_page() was called with. Here is the content of the
head and the first tail page, maybe it can help. page->private in the tail
page was zeroed already in __free_huge_page(), but its original value was
the broken *spool pointer 0000004e00000000, as seen in the register output
of the backtrace.

crash> struct -x page 0000037203dec000
struct page {
flags = 0x3ffff00000010000,
{
{
lru = {
next = 0x37203dec008,
prev = 0x37203dec008
},
mapping = 0x0,
index = 0x0,
private = 0x0
},
{
dma_addr = 0x37203dec008
},
{
{
slab_list = {
next = 0x37203dec008,
prev = 0x37203dec008
},
{
next = 0x37203dec008,
pages = 0x372,
pobjects = 0x3dec008
}
},
slab_cache = 0x0,
freelist = 0x0,
{
s_mem = 0x0,
counters = 0x0,
{
inuse = 0x0,
objects = 0x0,
frozen = 0x0
}
}
},
{
compound_head = 0x37203dec008,
compound_dtor = 0x0,
compound_order = 0x0,
compound_mapcount = {
counter = 0x3dec008
},
compound_nr = 0x0
},
{
_compound_pad_1 = 0x37203dec008,
hpage_pinned_refcount = {
counter = 0x372
},
deferred_list = {
next = 0x0,
prev = 0x0
}
},
{
_pt_pad_1 = 0x37203dec008,
pmd_huge_pte = 0x37203dec008,
_pt_pad_2 = 0x0,
{
pt_mm = 0x0,
pt_frag_refcount = {
counter = 0x0
}
},
ptl = {
{
rlock = {
raw_lock = {
lock = 0x0
}
}
}
}
},
{
pgmap = 0x37203dec008,
zone_device_data = 0x37203dec008
},
callback_head = {
next = 0x37203dec008,
func = 0x37203dec008
}
},
{
_mapcount = {
counter = 0xffffffff
},
page_type = 0xffffffff,
active = 0xffffffff,
units = 0xffffffff
},
_refcount = {
counter = 0x0
},
memcg_data = 0x0
}

crash> struct -x page 0000037203dec040
struct page {
flags = 0x3ffff00000000000,
{
{
lru = {
next = 0x37203dec001,
prev = 0x2080372ffffffff
},
mapping = 0x10000000400,
index = 0x2,
private = 0x0
},
{
dma_addr = 0x37203dec001
},
{
{
slab_list = {
next = 0x37203dec001,
prev = 0x2080372ffffffff
},
{
next = 0x37203dec001,
pages = 0x2080372,
pobjects = 0xffffffff
}
},
slab_cache = 0x10000000400,
freelist = 0x2,
{
s_mem = 0x0,
counters = 0x0,
{
inuse = 0x0,
objects = 0x0,
frozen = 0x0
}
}
},
{
compound_head = 0x37203dec001,
compound_dtor = 0x2,
compound_order = 0x8,
compound_mapcount = {
counter = 0xffffffff
},
compound_nr = 0x100
},
{
_compound_pad_1 = 0x37203dec001,
hpage_pinned_refcount = {
counter = 0x2080372
},
deferred_list = {
next = 0x10000000400,
prev = 0x2
}
},
{
_pt_pad_1 = 0x37203dec001,
pmd_huge_pte = 0x2080372ffffffff,
_pt_pad_2 = 0x10000000400,
{
pt_mm = 0x2,
pt_frag_refcount = {
counter = 0x0
}
},
ptl = {
{
rlock = {
raw_lock = {
lock = 0x0
}
}
}
}
},
{
pgmap = 0x37203dec001,
zone_device_data = 0x2080372ffffffff
},
callback_head = {
next = 0x37203dec001,
func = 0x2080372ffffffff
}
},
{
_mapcount = {
counter = 0xffffffff
},
page_type = 0xffffffff,
active = 0xffffffff,
units = 0xffffffff
},
_refcount = {
counter = 0x0
},
memcg_data = 0x0
}

2021-02-23 20:38:25

by Mike Kravetz

[permalink] [raw]

Subject: Re: [RFC] linux-next panic in hugepage_subpool_put_pages()

On 2/23/21 6:57 AM, Gerald Schaefer wrote:
> Hi,
>
> LTP triggered a panic on s390 in hugepage_subpool_put_pages() with
> linux-next 5.12.0-20210222, see below.
>
> It crashes on the spin_lock(&spool->lock) at the beginning, because the
> passed-in *spool points to 0000004e00000000, which is not addressable
> memory. It rather looks like some flags and not a proper address. I suspect
> some relation to the recent rework in that area, e.g. commit f1280272ae4d
> ("hugetlb: use page.private for hugetlb specific page flags").
>
> __free_huge_page() calls hugepage_subpool_put_pages() and takes *spool from
> hugetlb_page_subpool(page), which was changed by that commit to use
> page[1]->private now.
>

Thanks Gerald,

Yes, I believe f1280272ae4d is the root cause of this issue. In that
commit, the subpool pointer was moved from page->private of the head
page to page->private of the first subpage. The page allocator will
initialize (zero) the private field of the head page, but not that of
subpages. So, that bad subpool pointer is likely an old page->private
value for the page.

That strange call path from set_max_huge_pages to __free_huge_page is
actually how the code puts newly allocated pages on it's interfal free
list.

I will do a bit more verification and put together a patch (it should
be simple).
--
Mike Kravetz

2021-02-24 01:39:52

by Mike Kravetz

[permalink] [raw]

Subject: Re: [RFC] linux-next panic in hugepage_subpool_put_pages()

On 2/23/21 3:58 PM, Andrew Morton wrote:
> On Tue, 23 Feb 2021 10:06:12 -0800 Mike Kravetz <[email protected]> wrote:
>
>> On 2/23/21 6:57 AM, Gerald Schaefer wrote:
>>> Hi,
>>>
>>> LTP triggered a panic on s390 in hugepage_subpool_put_pages() with
>>> linux-next 5.12.0-20210222, see below.
>>>
>>> It crashes on the spin_lock(&spool->lock) at the beginning, because the
>>> passed-in *spool points to 0000004e00000000, which is not addressable
>>> memory. It rather looks like some flags and not a proper address. I suspect
>>> some relation to the recent rework in that area, e.g. commit f1280272ae4d
>>> ("hugetlb: use page.private for hugetlb specific page flags").
>>>
>>> __free_huge_page() calls hugepage_subpool_put_pages() and takes *spool from
>>> hugetlb_page_subpool(page), which was changed by that commit to use
>>> page[1]->private now.
>>>
>>
>> Thanks Gerald,
>>
>> Yes, I believe f1280272ae4d is the root cause of this issue. In that
>> commit, the subpool pointer was moved from page->private of the head
>> page to page->private of the first subpage. The page allocator will
>> initialize (zero) the private field of the head page, but not that of
>> subpages. So, that bad subpool pointer is likely an old page->private
>> value for the page.
>>
>> That strange call path from set_max_huge_pages to __free_huge_page is
>> actually how the code puts newly allocated pages on it's interfal free
>> list.
>>
>> I will do a bit more verification and put together a patch (it should
>> be simple).
>
> There's also Michel's documentation request:
> https://lkml.kernel.org/r/[email protected]
>

Thanks Andrew, I forgot about that.

It looks like the patch which added synchronization documentation requested
by Michal may not have be picked up.
https://lore.kernel.org/linux-mm/[email protected]/

If you still need to add that patch, I could redo and add the page[1]->private
documentation request mentioned here. Just let me know what is the easiest for
you.
--
Mike Kravetz

2021-02-24 02:12:57

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC] linux-next panic in hugepage_subpool_put_pages()

On Tue, 23 Feb 2021 17:29:48 -0800 Mike Kravetz <[email protected]> wrote:

> >> I will do a bit more verification and put together a patch (it should
> >> be simple).
> >
> > There's also Michel's documentation request:
> > https://lkml.kernel.org/r/[email protected]
> >
>
> Thanks Andrew, I forgot about that.
>
> It looks like the patch which added synchronization documentation requested
> by Michal may not have be picked up.
> https://lore.kernel.org/linux-mm/[email protected]/

OK, I grabbed that.

> If you still need to add that patch, I could redo and add the page[1]->private
> documentation request mentioned here. Just let me know what is the easiest for
> you.

Please send along an additional patch?

2021-02-24 05:39:33

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC] linux-next panic in hugepage_subpool_put_pages()

On Tue, 23 Feb 2021 10:06:12 -0800 Mike Kravetz <[email protected]> wrote:

> On 2/23/21 6:57 AM, Gerald Schaefer wrote:
> > Hi,
> >
> > LTP triggered a panic on s390 in hugepage_subpool_put_pages() with
> > linux-next 5.12.0-20210222, see below.
> >
> > It crashes on the spin_lock(&spool->lock) at the beginning, because the
> > passed-in *spool points to 0000004e00000000, which is not addressable
> > memory. It rather looks like some flags and not a proper address. I suspect
> > some relation to the recent rework in that area, e.g. commit f1280272ae4d
> > ("hugetlb: use page.private for hugetlb specific page flags").
> >
> > __free_huge_page() calls hugepage_subpool_put_pages() and takes *spool from
> > hugetlb_page_subpool(page), which was changed by that commit to use
> > page[1]->private now.
> >
>
> Thanks Gerald,
>
> Yes, I believe f1280272ae4d is the root cause of this issue. In that
> commit, the subpool pointer was moved from page->private of the head
> page to page->private of the first subpage. The page allocator will
> initialize (zero) the private field of the head page, but not that of
> subpages. So, that bad subpool pointer is likely an old page->private
> value for the page.
>
> That strange call path from set_max_huge_pages to __free_huge_page is
> actually how the code puts newly allocated pages on it's interfal free
> list.
>
> I will do a bit more verification and put together a patch (it should
> be simple).

There's also Michel's documentation request:
https://lkml.kernel.org/r/[email protected]

2021-02-24 12:43:16

by Mike Kravetz

[permalink] [raw]

Subject: [PATCH] hugetlb: document the new location of page subpool pointer

Expand comments, no functional change.

Signed-off-by: Mike Kravetz <[email protected]>
---
include/linux/hugetlb.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index cccd1aab69dd..c0467a7a1fe0 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -476,6 +476,9 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
* huegtlb page specific state flags. These flags are located in page.private
* of the hugetlb head page. Functions created via the below macros should be
* used to manipulate these flags.
+ * Note: The hugetlb page subpool pointer previously located at page.private
+ * was moved to page[1].private to make room for flags in the head page. See
+ * hugetlb_page_subpool().
*
* HPG_restore_reserve - Set when a hugetlb page consumes a reservation at
* allocation time. Cleared when page is fully instantiated. Free
--
2.29.2

2021-02-24 13:06:33

by Muchun Song

[permalink] [raw]

Subject: Re: [External] [PATCH] hugetlb: document the new location of page subpool pointer

On Wed, Feb 24, 2021 at 12:04 PM Mike Kravetz <[email protected]> wrote:
>
> Expand comments, no functional change.
>
> Signed-off-by: Mike Kravetz <[email protected]>

Reviewed-by: Muchun Song <[email protected]>

Thanks

> ---
> include/linux/hugetlb.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index cccd1aab69dd..c0467a7a1fe0 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -476,6 +476,9 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
> * huegtlb page specific state flags. These flags are located in page.private
> * of the hugetlb head page. Functions created via the below macros should be
> * used to manipulate these flags.
> + * Note: The hugetlb page subpool pointer previously located at page.private
> + * was moved to page[1].private to make room for flags in the head page. See
> + * hugetlb_page_subpool().
> *
> * HPG_restore_reserve - Set when a hugetlb page consumes a reservation at
> * allocation time. Cleared when page is fully instantiated. Free
> --
> 2.29.2
>

2021-02-24 13:10:05

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] hugetlb: document the new location of page subpool pointer

On 24.02.21 05:04, Mike Kravetz wrote:
> Expand comments, no functional change.
>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> include/linux/hugetlb.h | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index cccd1aab69dd..c0467a7a1fe0 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -476,6 +476,9 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
> * huegtlb page specific state flags. These flags are located in page.private
> * of the hugetlb head page. Functions created via the below macros should be
> * used to manipulate these flags.
> + * Note: The hugetlb page subpool pointer previously located at page.private
> + * was moved to page[1].private to make room for flags in the head page. See
> + * hugetlb_page_subpool().
> *
> * HPG_restore_reserve - Set when a hugetlb page consumes a reservation at
> * allocation time. Cleared when page is fully instantiated. Free
>

Reviewed-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb