2022-02-05 04:32:00

by Song Liu

[permalink] [raw]
Subject: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP

From: Song Liu <[email protected]>

This enables module_alloc() to allocate huge page for 2MB+ requests.
To check the difference of this change, we need enable config
CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change,
/sys/kernel/debug/page_tables/kernel shows pte for this map. With the
change, /sys/kernel/debug/page_tables/ show pmd for thie map.

Signed-off-by: Song Liu <[email protected]>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6fddb63271d9..e0e0d00cf103 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -159,6 +159,7 @@ config X86
select HAVE_ALIGNED_STRUCT_PAGE if SLUB
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE
+ select HAVE_ARCH_HUGE_VMALLOC if HAVE_ARCH_HUGE_VMAP
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_JUMP_LABEL_RELATIVE
select HAVE_ARCH_KASAN if X86_64
--
2.30.2


2022-03-26 00:14:17

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP

On Fri, 2022-02-04 at 10:57 -0800, Song Liu wrote:
> From: Song Liu <[email protected]>
>
> This enables module_alloc() to allocate huge page for 2MB+ requests.
> To check the difference of this change, we need enable config
> CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change,
> /sys/kernel/debug/page_tables/kernel shows pte for this map. With the
> change, /sys/kernel/debug/page_tables/ show pmd for thie map.
>
> Signed-off-by: Song Liu <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> 1 file changed, 1 insertion(+)

Hi,

I just saw this upstream today. Glad to see this functionality, but I
think turning on huge vmalloc pages for x86 needs a bit more. I’ll
describe a couple possible failure modes I haven’t actually tested.

One problem is that the direct map permission reset part in vmalloc
assumes any special permissioned pages are mapped 4k on the direct map.
Otherwise the operation could fail to reset a page RW if a PTE page
allocation fails when it tries to split the page to toggle a 4k sized
region NP/P. If you are not familiar, x86 CPA generally leaves the
direct map page sizes mirroring the primary alias (vmalloc). So once
vmalloc has huge pages, the special permissioned direct map aliases
will have them too. This limitation of HAVE_ARCH_HUGE_VMALLOC is
actually hinted about in the Kconfig comments, but I guess it wasn’t
specific that x86 has these properties.

I think to make the vmalloc resetting part safe:
1. set_direct_map_invalid/default() needs to support multiple pages
like this[0].
2. vm_remove_mappings() needs to call them with the correct page size
in the hpage case so they don't cause a split[1].
3. Then hibernate needs to be blocked during this operation so it
doesn’t encounter the now sometimes huge NP pages, which it can’t
handle. Not sure what the right way to do this is, but potentially like
in the diff below[1].

Another problem is that CPA will sometimes now split pages of vmalloc
mappings in cases where it sets a region of an allocation to a
different permission than the rest (for example regular modules calling
set_memory_x() on the text section). Before this change, these couldn’t
fail since the module space mapping would never require a split.
Modules doesn’t check for failure there, so I’m thinking now it would
proceed to try to execute NX memory if the split failed. It could only
happen on allocation of especially large modules. Maybe it should just
be avoided for now by having regular module allocations pass
VM_NO_HUGE_VMAP on x86. And BPF could call __vmalloc_node_range()
directly to get 2MB vmallocs.

[0]
https://lore.kernel.org/lkml/[email protected]/#t

[1] Untested, but something like this possibly:
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 99e0f3e8d1a5..97c4ca3a29b1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -42,6 +42,7 @@
#include <linux/sched/mm.h>
#include <asm/tlbflush.h>
#include <asm/shmparam.h>
+#include <linux/suspend.h>

#include "internal.h"
#include "pgalloc-track.h"
@@ -2241,7 +2242,7 @@ EXPORT_SYMBOL(vm_map_ram);

static struct vm_struct *vmlist __initdata;

-static inline unsigned int vm_area_page_order(struct vm_struct *vm)
+static inline unsigned int vm_area_page_order(const struct vm_struct
*vm)
{
#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
return vm->page_order;
@@ -2560,12 +2561,12 @@ struct vm_struct *remove_vm_area(const void
*addr)
static inline void set_area_direct_map(const struct vm_struct *area,
int (*set_direct_map)(struct
page *page))
{
+ unsigned int page_order = vm_area_page_order(area);
int i;

- /* HUGE_VMALLOC passes small pages to set_direct_map */
- for (i = 0; i < area->nr_pages; i++)
+ for (i = 0; i < area->nr_pages; i += 1U << page_order)
if (page_address(area->pages[i]))
- set_direct_map(area->pages[i]);
+ set_direct_map(area->pages[i], 1U <<
page_order);
}

/* Handle removing and resetting vm mappings related to the vm_struct.
*/
@@ -2592,6 +2593,10 @@ static void vm_remove_mappings(struct vm_struct
*area, int deallocate_pages)
return;
}

+ /* Hibernate can't handle large NP pages */
+ if (page_order)
+ lock_system_sleep();
+
/*
* If execution gets here, flush the vm mapping and reset the
direct
* map. Find the start and end range of the direct mappings to
make sure
@@ -2617,6 +2622,9 @@ static void vm_remove_mappings(struct vm_struct
*area, int deallocate_pages)
set_area_direct_map(area, set_direct_map_invalid_noflush);
_vm_unmap_aliases(start, end, flush_dmap);
set_area_direct_map(area, set_direct_map_default_noflush);
+
+ if (page_order)
+ unlock_system_sleep();
}

static void __vunmap(const void *addr, int deallocate_pages)

2022-03-26 20:20:16

by Paul Menzel

[permalink] [raw]
Subject: BUG: Bad page state in process systemd-udevd (was: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP)

#regzbot introduced: fac54e2bfb5be2b0bbf115fe80d45f59fd773048
#regzbot title: BUG: Bad page state in process systemd-udevd


Dear Song,


Am 04.02.22 um 19:57 schrieb Song Liu:
> From: Song Liu <[email protected]>
>
> This enables module_alloc() to allocate huge page for 2MB+ requests.
> To check the difference of this change, we need enable config
> CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change,
> /sys/kernel/debug/page_tables/kernel shows pte for this map. With the
> change, /sys/kernel/debug/page_tables/ show pmd for thie map.
>
> Signed-off-by: Song Liu <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 6fddb63271d9..e0e0d00cf103 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -159,6 +159,7 @@ config X86
> select HAVE_ALIGNED_STRUCT_PAGE if SLUB
> select HAVE_ARCH_AUDITSYSCALL
> select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE
> + select HAVE_ARCH_HUGE_VMALLOC if HAVE_ARCH_HUGE_VMAP
> select HAVE_ARCH_JUMP_LABEL
> select HAVE_ARCH_JUMP_LABEL_RELATIVE
> select HAVE_ARCH_KASAN if X86_64

Testing Linus’ current master branch, Linux logs critical messages like
below:

BUG: Bad page state in process systemd-udevd pfn:102e03

I bisected to your commit fac54e2bfb5 (x86/Kconfig: select
HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP).


Kind regards,

Paul


Attachments:
linux-5.18-vmap-bad-page-state.txt (264.83 kB)

2022-03-27 22:41:30

by Paul Menzel

[permalink] [raw]
Subject: Re: BUG: Bad page state in process systemd-udevd (was: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP)

Dear Song,


Am 26.03.22 um 19:46 schrieb Paul Menzel:
> #regzbot introduced: fac54e2bfb5be2b0bbf115fe80d45f59fd773048
> #regzbot title: BUG: Bad page state in process systemd-udevd

> Am 04.02.22 um 19:57 schrieb Song Liu:
>> From: Song Liu <[email protected]>
>>
>> This enables module_alloc() to allocate huge page for 2MB+ requests.
>> To check the difference of this change, we need enable config
>> CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change,
>> /sys/kernel/debug/page_tables/kernel shows pte for this map. With the
>> change, /sys/kernel/debug/page_tables/ show pmd for thie map.
>>
>> Signed-off-by: Song Liu <[email protected]>
>> ---
>>   arch/x86/Kconfig | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 6fddb63271d9..e0e0d00cf103 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -159,6 +159,7 @@ config X86
>>       select HAVE_ALIGNED_STRUCT_PAGE        if SLUB
>>       select HAVE_ARCH_AUDITSYSCALL
>>       select HAVE_ARCH_HUGE_VMAP        if X86_64 || X86_PAE
>> +    select HAVE_ARCH_HUGE_VMALLOC        if HAVE_ARCH_HUGE_VMAP
>>       select HAVE_ARCH_JUMP_LABEL
>>       select HAVE_ARCH_JUMP_LABEL_RELATIVE
>>       select HAVE_ARCH_KASAN            if X86_64
>
> Testing Linus’ current master branch, Linux logs critical messages like
> below:
>
>     BUG: Bad page state in process systemd-udevd  pfn:102e03
>
> I bisected to your commit fac54e2bfb5 (x86/Kconfig: select
> HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP).

Sorry, I forget to mention, that this is a 32-bit (i686) userspace, but
a 64-bit Linux kernel, so it might be the same issue as mentioned in
commit eed1fcee556f (x86: Disable HAVE_ARCH_HUGE_VMALLOC on 32-bit x86),
but didn’t fix the issue for 64-bit Linux kernel and 32-bit userspace.


Kind regards,

Paul

2022-03-28 11:24:58

by Paul Menzel

[permalink] [raw]
Subject: Re: BUG: Bad page state in process systemd-udevd (was: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP)

Dear Song,


Am 28.03.22 um 08:37 schrieb Song Liu:
> Thanks Paul for highlighting the issue.

Thank you for getting back to me so quickly.

> + Rick, who highlighted some potential issues with this. (also attached
> the stack trace).

I already had added him, but forgot to document it in the message. Sorry
for that.

>> On Mar 27, 2022, at 3:36 AM, Paul Menzel <[email protected]> wrote:

>> Am 26.03.22 um 19:46 schrieb Paul Menzel:
>>> #regzbot introduced: fac54e2bfb5be2b0bbf115fe80d45f59fd773048
>>> #regzbot title: BUG: Bad page state in process systemd-udevd
>>
>>> Am 04.02.22 um 19:57 schrieb Song Liu:
>>>> From: Song Liu <[email protected]>
>>>>
>>>> This enables module_alloc() to allocate huge page for 2MB+ requests.
>>>> To check the difference of this change, we need enable config
>>>> CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change,
>>>> /sys/kernel/debug/page_tables/kernel shows pte for this map. With the
>>>> change, /sys/kernel/debug/page_tables/ show pmd for thie map.
>>>>
>>>> Signed-off-by: Song Liu <[email protected]>
>>>> ---
>>>> arch/x86/Kconfig | 1 +
>>>> 1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>> index 6fddb63271d9..e0e0d00cf103 100644
>>>> --- a/arch/x86/Kconfig
>>>> +++ b/arch/x86/Kconfig
>>>> @@ -159,6 +159,7 @@ config X86
>>>> select HAVE_ALIGNED_STRUCT_PAGE if SLUB
>>>> select HAVE_ARCH_AUDITSYSCALL
>>>> select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE
>>>> + select HAVE_ARCH_HUGE_VMALLOC if HAVE_ARCH_HUGE_VMAP
>>>> select HAVE_ARCH_JUMP_LABEL
>>>> select HAVE_ARCH_JUMP_LABEL_RELATIVE
>>>> select HAVE_ARCH_KASAN if X86_64
>>> Testing Linus’ current master branch, Linux logs critical messages like below:
>>> BUG: Bad page state in process systemd-udevd pfn:102e03
>>> I bisected to your commit fac54e2bfb5 (x86/Kconfig: select
>>> HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP).
>>
>> Sorry, I forget to mention, that this is a 32-bit (i686) userspace,
>> but a 64-bit Linux kernel, so it might be the same issue as
>> mentioned in commit eed1fcee556f (x86: Disable
>> HAVE_ARCH_HUGE_VMALLOC on 32-bit x86), but didn’t fix the issue for
>> 64-bit Linux kernel and 32-bit userspace.
>
> I will look more into this tomorrow. To clarify, what is the 32-bit
> user space that triggers this? Is it systemd-udevd? Is the systemd
> also i686?

Yes, everything – also systemd – is i686. You can build a 32-bit VM
image with grml-debootstrap [1]:

sudo DEBOOTSTRAP=mmdebstrap ~/src/grml-debootstrap/grml-debootstrap
--vm --vmfile --vmsize 3G --target /dev/shm/debian-32.img -r sid --arch
i686 --filesystem ext4

Then run that with QEMU, but pass the 64-bit Linux kernel to QEMU
directly with the switches `-kernel` and `-append`, or install the amd64
Linux kernel into the Debian VM image or the package created with `make
bindeb-pkg` with `dpkg -i …`.


Kind regards,

Paul


[1]: https://github.com/grml/grml-debootstrap/

2022-03-28 21:16:40

by Song Liu

[permalink] [raw]
Subject: Re: BUG: Bad page state in process systemd-udevd (was: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP)



> On Mar 27, 2022, at 11:51 PM, Paul Menzel <[email protected]> wrote:
>
> Dear Song,
>
>
> Am 28.03.22 um 08:37 schrieb Song Liu:
>> Thanks Paul for highlighting the issue.
>
> Thank you for getting back to me so quickly.
>
>> + Rick, who highlighted some potential issues with this. (also attached
>> the stack trace).
>
> I already had added him, but forgot to document it in the message. Sorry for that.
>
>>> On Mar 27, 2022, at 3:36 AM, Paul Menzel <[email protected]> wrote:
>
>>> Am 26.03.22 um 19:46 schrieb Paul Menzel:
>>>> #regzbot introduced: fac54e2bfb5be2b0bbf115fe80d45f59fd773048
>>>> #regzbot title: BUG: Bad page state in process systemd-udevd
>>>
>>>> Am 04.02.22 um 19:57 schrieb Song Liu:
>>>>> From: Song Liu <[email protected]>
>>>>>
>>>>> This enables module_alloc() to allocate huge page for 2MB+ requests.
>>>>> To check the difference of this change, we need enable config
>>>>> CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change,
>>>>> /sys/kernel/debug/page_tables/kernel shows pte for this map. With the
>>>>> change, /sys/kernel/debug/page_tables/ show pmd for thie map.
>>>>>
>>>>> Signed-off-by: Song Liu <[email protected]>
>>>>> ---
>>>>> arch/x86/Kconfig | 1 +
>>>>> 1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>>> index 6fddb63271d9..e0e0d00cf103 100644
>>>>> --- a/arch/x86/Kconfig
>>>>> +++ b/arch/x86/Kconfig
>>>>> @@ -159,6 +159,7 @@ config X86
>>>>> select HAVE_ALIGNED_STRUCT_PAGE if SLUB
>>>>> select HAVE_ARCH_AUDITSYSCALL
>>>>> select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE
>>>>> + select HAVE_ARCH_HUGE_VMALLOC if HAVE_ARCH_HUGE_VMAP
>>>>> select HAVE_ARCH_JUMP_LABEL
>>>>> select HAVE_ARCH_JUMP_LABEL_RELATIVE
>>>>> select HAVE_ARCH_KASAN if X86_64
>>>> Testing Linus’ current master branch, Linux logs critical messages like below:
>>>> BUG: Bad page state in process systemd-udevd pfn:102e03
>>>> I bisected to your commit fac54e2bfb5 (x86/Kconfig: select
>>>> HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP).
>>> Sorry, I forget to mention, that this is a 32-bit (i686) userspace,
>>> but a 64-bit Linux kernel, so it might be the same issue as
>>> mentioned in commit eed1fcee556f (x86: Disable
>>> HAVE_ARCH_HUGE_VMALLOC on 32-bit x86), but didn’t fix the issue for
>>> 64-bit Linux kernel and 32-bit userspace.
>> I will look more into this tomorrow. To clarify, what is the 32-bit
>> user space that triggers this? Is it systemd-udevd? Is the systemd
>> also i686?
>
> Yes, everything – also systemd – is i686. You can build a 32-bit VM image with grml-debootstrap [1]:
>
> sudo DEBOOTSTRAP=mmdebstrap ~/src/grml-debootstrap/grml-debootstrap --vm --vmfile --vmsize 3G --target /dev/shm/debian-32.img -r sid --arch i686 --filesystem ext4
>
> Then run that with QEMU, but pass the 64-bit Linux kernel to QEMU directly with the switches `-kernel` and `-append`, or install the amd64 Linux kernel into the Debian VM image or the package created with `make bindeb-pkg` with `dpkg -i …`.

Thanks for these information!

I tried the following, but couldn't reproduce the issue.

sudo ./grml-debootstrap --vm --vmfile --vmsize 3G --target ../debian-32.img -r sid --arch i386 --filesystem ext4

Note: s/i686/i386/. Also I run this on Fedora, so I didn't specify DEBOOTSTRAP.

Then I run it with

qemu-system-x86_64 \
-boot d ./debian-32.img -m 1024 -smp 4 \
-kernel ./bzImage \
-nographic -append 'root=/dev/sda1 ro console=ttyS0,115200'

The VM boots fine. The config being used is x86_64_defconfig +
CONFIG_DRM_FBDEV_EMULATION.

I wonder whether this is caused by different config or different image.
Could you please share your config?

Thanks,
Song

PS: I couldn't figure out the root password of the image, --password
option of grml-debootstrap doesn't seem to work.

2022-03-28 21:40:40

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: BUG: Bad page state in process systemd-udevd (was: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP)

On Mon, 2022-03-28 at 06:37 +0000, Song Liu wrote:
> + Rick, who highlighted some potential issues with this. (also
> attached
> the stack trace).

Yea, this looks like something else. Just a wild guess, but could some
vmalloc() caller be mucking with the struct page of the backing
allocation in way that trips up compound pages?

2022-03-28 21:50:48

by Song Liu

[permalink] [raw]
Subject: Re: BUG: Bad page state in process systemd-udevd (was: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP)

Thanks Paul for highlighting the issue.

+ Rick, who highlighted some potential issues with this. (also attached
the stack trace).

> On Mar 27, 2022, at 3:36 AM, Paul Menzel <[email protected]> wrote:
>
> Dear Song,
>
>
> Am 26.03.22 um 19:46 schrieb Paul Menzel:
>> #regzbot introduced: fac54e2bfb5be2b0bbf115fe80d45f59fd773048
>> #regzbot title: BUG: Bad page state in process systemd-udevd
>
>> Am 04.02.22 um 19:57 schrieb Song Liu:
>>> From: Song Liu <[email protected]>
>>>
>>> This enables module_alloc() to allocate huge page for 2MB+ requests.
>>> To check the difference of this change, we need enable config
>>> CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change,
>>> /sys/kernel/debug/page_tables/kernel shows pte for this map. With the
>>> change, /sys/kernel/debug/page_tables/ show pmd for thie map.
>>>
>>> Signed-off-by: Song Liu <[email protected]>
>>> ---
>>> arch/x86/Kconfig | 1 +
>>> 1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index 6fddb63271d9..e0e0d00cf103 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -159,6 +159,7 @@ config X86
>>> select HAVE_ALIGNED_STRUCT_PAGE if SLUB
>>> select HAVE_ARCH_AUDITSYSCALL
>>> select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE
>>> + select HAVE_ARCH_HUGE_VMALLOC if HAVE_ARCH_HUGE_VMAP
>>> select HAVE_ARCH_JUMP_LABEL
>>> select HAVE_ARCH_JUMP_LABEL_RELATIVE
>>> select HAVE_ARCH_KASAN if X86_64
>> Testing Linus’ current master branch, Linux logs critical messages like below:
>> BUG: Bad page state in process systemd-udevd pfn:102e03
>> I bisected to your commit fac54e2bfb5 (x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP).
>
> Sorry, I forget to mention, that this is a 32-bit (i686) userspace, but a 64-bit Linux kernel, so it might be the same issue as mentioned in commit eed1fcee556f (x86: Disable HAVE_ARCH_HUGE_VMALLOC on 32-bit x86), but didn’t fix the issue for 64-bit Linux kernel and 32-bit userspace.

I will look more into this tomorrow. To clarify, what is the 32-bit user
space that triggers this? Is it systemd-udevd? Is the systemd also i686?

Thanks,
Song


Attachments:
linux-5.18-vmap-bad-page-state.txt (269.08 kB)
linux-5.18-vmap-bad-page-state.txt

2022-03-28 22:52:05

by Paul Menzel

[permalink] [raw]
Subject: Re: BUG: Bad page state in process systemd-udevd (was: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP)

Dear Song,


Am 28.03.22 um 21:24 schrieb Song Liu:

>> On Mar 27, 2022, at 11:51 PM, Paul Menzel <[email protected]> wrote:

>> Am 28.03.22 um 08:37 schrieb Song Liu:

[…]

>>>> On Mar 27, 2022, at 3:36 AM, Paul Menzel <[email protected]> wrote:
>>
>>>> Am 26.03.22 um 19:46 schrieb Paul Menzel:
>>>>> #regzbot introduced: fac54e2bfb5be2b0bbf115fe80d45f59fd773048
>>>>> #regzbot title: BUG: Bad page state in process systemd-udevd
>>>>
>>>>> Am 04.02.22 um 19:57 schrieb Song Liu:
>>>>>> From: Song Liu <[email protected]>
>>>>>>
>>>>>> This enables module_alloc() to allocate huge page for 2MB+ requests.
>>>>>> To check the difference of this change, we need enable config
>>>>>> CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change,
>>>>>> /sys/kernel/debug/page_tables/kernel shows pte for this map. With the
>>>>>> change, /sys/kernel/debug/page_tables/ show pmd for thie map.
>>>>>>
>>>>>> Signed-off-by: Song Liu <[email protected]>
>>>>>> ---
>>>>>> arch/x86/Kconfig | 1 +
>>>>>> 1 file changed, 1 insertion(+)
>>>>>>
>>>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>>>> index 6fddb63271d9..e0e0d00cf103 100644
>>>>>> --- a/arch/x86/Kconfig
>>>>>> +++ b/arch/x86/Kconfig
>>>>>> @@ -159,6 +159,7 @@ config X86
>>>>>> select HAVE_ALIGNED_STRUCT_PAGE if SLUB
>>>>>> select HAVE_ARCH_AUDITSYSCALL
>>>>>> select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE
>>>>>> + select HAVE_ARCH_HUGE_VMALLOC if HAVE_ARCH_HUGE_VMAP
>>>>>> select HAVE_ARCH_JUMP_LABEL
>>>>>> select HAVE_ARCH_JUMP_LABEL_RELATIVE
>>>>>> select HAVE_ARCH_KASAN if X86_64
>>>>> Testing Linus’ current master branch, Linux logs critical messages like below:
>>>>> BUG: Bad page state in process systemd-udevd pfn:102e03
>>>>> I bisected to your commit fac54e2bfb5 (x86/Kconfig: select
>>>>> HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP).
>>>> Sorry, I forget to mention, that this is a 32-bit (i686) userspace,
>>>> but a 64-bit Linux kernel, so it might be the same issue as
>>>> mentioned in commit eed1fcee556f (x86: Disable
>>>> HAVE_ARCH_HUGE_VMALLOC on 32-bit x86), but didn’t fix the issue for
>>>> 64-bit Linux kernel and 32-bit userspace.
>>> I will look more into this tomorrow. To clarify, what is the 32-bit
>>> user space that triggers this? Is it systemd-udevd? Is the systemd
>>> also i686?
>>
>> Yes, everything – also systemd – is i686. You can build a 32-bit VM image with grml-debootstrap [1]:
>>
>> sudo DEBOOTSTRAP=mmdebstrap ~/src/grml-debootstrap/grml-debootstrap --vm --vmfile --vmsize 3G --target /dev/shm/debian-32.img -r sid --arch i686 --filesystem ext4
>>
>> Then run that with QEMU, but pass the 64-bit Linux kernel to QEMU directly with the switches `-kernel` and `-append`, or install the amd64 Linux kernel into the Debian VM image or the package created with `make bindeb-pkg` with `dpkg -i …`.
>
> Thanks for these information!
>
> I tried the following, but couldn't reproduce the issue.
>
> sudo ./grml-debootstrap --vm --vmfile --vmsize 3G --target ../debian-32.img -r sid --arch i386 --filesystem ext4
>
> Note: s/i686/i386/. Also I run this on Fedora, so I didn't specify DEBOOTSTRAP.
>
> Then I run it with
>
> qemu-system-x86_64 \
> -boot d ./debian-32.img -m 1024 -smp 4 \
> -kernel ./bzImage \
> -nographic -append 'root=/dev/sda1 ro console=ttyS0,115200'
>
> The VM boots fine. The config being used is x86_64_defconfig +
> CONFIG_DRM_FBDEV_EMULATION.
>
> I wonder whether this is caused by different config or different image.
> Could you please share your config?

Sorry, for leading you on the wrong path. I actually just wanted to help
getting a 32-bit userspace set up quickly. I haven’t tried reproducing
the issue in a VM, and used only the ASUS F2A85-M PRO.

Booting the system with `nomodeset`, I didn’t see the error. No idea if
it’s related to framebuffer handling or specific to AMD graphics device.

> PS: I couldn't figure out the root password of the image, --password
> option of grml-debootstrap doesn't seem to work.

Hmm, I thought it’s asking you during install, but I haven’t done it in
a while.


Kind regards,

Paul

2022-03-28 23:26:51

by Song Liu

[permalink] [raw]
Subject: Re: BUG: Bad page state in process systemd-udevd (was: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP)



> On Mar 28, 2022, at 1:14 PM, Paul Menzel <[email protected]> wrote:
>
> Dear Song,
>
>
> Am 28.03.22 um 21:24 schrieb Song Liu:
>
>>> On Mar 27, 2022, at 11:51 PM, Paul Menzel <[email protected]> wrote:
>
>>> Am 28.03.22 um 08:37 schrieb Song Liu:
>
> […]
>
>>>>> On Mar 27, 2022, at 3:36 AM, Paul Menzel <[email protected]> wrote:
>>>
>>>>> Am 26.03.22 um 19:46 schrieb Paul Menzel:
>>>>>> #regzbot introduced: fac54e2bfb5be2b0bbf115fe80d45f59fd773048
>>>>>> #regzbot title: BUG: Bad page state in process systemd-udevd
>>>>>
>>>>>> Am 04.02.22 um 19:57 schrieb Song Liu:
>>>>>>> From: Song Liu <[email protected]>
>>>>>>>
>>>>>>> This enables module_alloc() to allocate huge page for 2MB+ requests.
>>>>>>> To check the difference of this change, we need enable config
>>>>>>> CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change,
>>>>>>> /sys/kernel/debug/page_tables/kernel shows pte for this map. With the
>>>>>>> change, /sys/kernel/debug/page_tables/ show pmd for thie map.
>>>>>>>
>>>>>>> Signed-off-by: Song Liu <[email protected]>
>>>>>>> ---
>>>>>>> arch/x86/Kconfig | 1 +
>>>>>>> 1 file changed, 1 insertion(+)
>>>>>>>
>>>>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>>>>> index 6fddb63271d9..e0e0d00cf103 100644
>>>>>>> --- a/arch/x86/Kconfig
>>>>>>> +++ b/arch/x86/Kconfig
>>>>>>> @@ -159,6 +159,7 @@ config X86
>>>>>>> select HAVE_ALIGNED_STRUCT_PAGE if SLUB
>>>>>>> select HAVE_ARCH_AUDITSYSCALL
>>>>>>> select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE
>>>>>>> + select HAVE_ARCH_HUGE_VMALLOC if HAVE_ARCH_HUGE_VMAP
>>>>>>> select HAVE_ARCH_JUMP_LABEL
>>>>>>> select HAVE_ARCH_JUMP_LABEL_RELATIVE
>>>>>>> select HAVE_ARCH_KASAN if X86_64
>>>>>> Testing Linus’ current master branch, Linux logs critical messages like below:
>>>>>> BUG: Bad page state in process systemd-udevd pfn:102e03
>>>>>> I bisected to your commit fac54e2bfb5 (x86/Kconfig: select
>>>>>> HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP).
>>>>> Sorry, I forget to mention, that this is a 32-bit (i686) userspace,
>>>>> but a 64-bit Linux kernel, so it might be the same issue as
>>>>> mentioned in commit eed1fcee556f (x86: Disable
>>>>> HAVE_ARCH_HUGE_VMALLOC on 32-bit x86), but didn’t fix the issue for
>>>>> 64-bit Linux kernel and 32-bit userspace.
>>>> I will look more into this tomorrow. To clarify, what is the 32-bit
>>>> user space that triggers this? Is it systemd-udevd? Is the systemd
>>>> also i686?
>>>
>>> Yes, everything – also systemd – is i686. You can build a 32-bit VM image with grml-debootstrap [1]:
>>>
>>> sudo DEBOOTSTRAP=mmdebstrap ~/src/grml-debootstrap/grml-debootstrap --vm --vmfile --vmsize 3G --target /dev/shm/debian-32.img -r sid --arch i686 --filesystem ext4
>>>
>>> Then run that with QEMU, but pass the 64-bit Linux kernel to QEMU directly with the switches `-kernel` and `-append`, or install the amd64 Linux kernel into the Debian VM image or the package created with `make bindeb-pkg` with `dpkg -i …`.
>> Thanks for these information!
>> I tried the following, but couldn't reproduce the issue.
>> sudo ./grml-debootstrap --vm --vmfile --vmsize 3G --target ../debian-32.img -r sid --arch i386 --filesystem ext4
>> Note: s/i686/i386/. Also I run this on Fedora, so I didn't specify DEBOOTSTRAP.
>> Then I run it with
>> qemu-system-x86_64 \
>> -boot d ./debian-32.img -m 1024 -smp 4 \
>> -kernel ./bzImage \
>> -nographic -append 'root=/dev/sda1 ro console=ttyS0,115200'
>> The VM boots fine. The config being used is x86_64_defconfig +
>> CONFIG_DRM_FBDEV_EMULATION.
>> I wonder whether this is caused by different config or different image.
>> Could you please share your config?
>
> Sorry, for leading you on the wrong path. I actually just wanted to help getting a 32-bit userspace set up quickly. I haven’t tried reproducing the issue in a VM, and used only the ASUS F2A85-M PRO.
>
> Booting the system with `nomodeset`, I didn’t see the error. No idea if it’s related to framebuffer handling or specific to AMD graphics device.

I guess this only happens on specific hardware and configuration.
Let me see what's the best way to not allocate huge pages for this
case.

Thanks,
Song

>
>> PS: I couldn't figure out the root password of the image, --password
>> option of grml-debootstrap doesn't seem to work.
>
> Hmm, I thought it’s asking you during install, but I haven’t done it in a while.
>
>
> Kind regards,
>
> Paul

2022-03-28 23:58:25

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP

+Paul

> On Mar 25, 2022, at 5:06 PM, Edgecombe, Rick P <[email protected]> wrote:
>
> On Fri, 2022-02-04 at 10:57 -0800, Song Liu wrote:
>> From: Song Liu <[email protected]>
>>
>> This enables module_alloc() to allocate huge page for 2MB+ requests.
>> To check the difference of this change, we need enable config
>> CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change,
>> /sys/kernel/debug/page_tables/kernel shows pte for this map. With the
>> change, /sys/kernel/debug/page_tables/ show pmd for thie map.
>>
>> Signed-off-by: Song Liu <[email protected]>
>> ---
>> arch/x86/Kconfig | 1 +
>> 1 file changed, 1 insertion(+)
>
> Hi,
>
> I just saw this upstream today. Glad to see this functionality, but I
> think turning on huge vmalloc pages for x86 needs a bit more. I’ll
> describe a couple possible failure modes I haven’t actually tested.
>
> One problem is that the direct map permission reset part in vmalloc
> assumes any special permissioned pages are mapped 4k on the direct map.
> Otherwise the operation could fail to reset a page RW if a PTE page
> allocation fails when it tries to split the page to toggle a 4k sized
> region NP/P. If you are not familiar, x86 CPA generally leaves the
> direct map page sizes mirroring the primary alias (vmalloc). So once
> vmalloc has huge pages, the special permissioned direct map aliases
> will have them too. This limitation of HAVE_ARCH_HUGE_VMALLOC is
> actually hinted about in the Kconfig comments, but I guess it wasn’t
> specific that x86 has these properties.
>
> I think to make the vmalloc resetting part safe:
> 1. set_direct_map_invalid/default() needs to support multiple pages
> like this[0].
> 2. vm_remove_mappings() needs to call them with the correct page size
> in the hpage case so they don't cause a split[1].
> 3. Then hibernate needs to be blocked during this operation so it
> doesn’t encounter the now sometimes huge NP pages, which it can’t
> handle. Not sure what the right way to do this is, but potentially like
> in the diff below[1].
>
> Another problem is that CPA will sometimes now split pages of vmalloc
> mappings in cases where it sets a region of an allocation to a
> different permission than the rest (for example regular modules calling
> set_memory_x() on the text section). Before this change, these couldn’t
> fail since the module space mapping would never require a split.
> Modules doesn’t check for failure there, so I’m thinking now it would
> proceed to try to execute NX memory if the split failed. It could only
> happen on allocation of especially large modules. Maybe it should just
> be avoided for now by having regular module allocations pass
> VM_NO_HUGE_VMAP on x86. And BPF could call __vmalloc_node_range()
> directly to get 2MB vmallocs.

I like this direction. But I am afraid this is not enough. Using
VM_NO_HUGE_VMAP in module_alloc() will make sure we don't allocate
huge pages for modules. But other users of __vmalloc_node_range(),
such as vzalloc in Paul's report, may still hit the issue.

Maybe we need another flag VM_FORCE_HUGE_VMAP that bypasses
vmap_allow_huge check. Something like the diff below.

Would this work?

Thanks,
Song



diff --git i/include/linux/vmalloc.h w/include/linux/vmalloc.h
index 3b1df7da402d..a639405dab99 100644
--- i/include/linux/vmalloc.h
+++ w/include/linux/vmalloc.h
@@ -27,6 +27,7 @@ struct notifier_block; /* in notifier.h */
#define VM_FLUSH_RESET_PERMS 0x00000100 /* reset direct map and flush TLB on unmap, can't be freed in atomic context */
#define VM_MAP_PUT_PAGES 0x00000200 /* put pages and free array in vfree */
#define VM_NO_HUGE_VMAP 0x00000400 /* force PAGE_SIZE pte mapping */
+#define VM_FORCE_HUGE_VMAP 0x00000800 /* force PMD_SIZE mapping (bypass vmap_allow_huge check) */

#if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
!defined(CONFIG_KASAN_VMALLOC)
diff --git i/kernel/bpf/core.c w/kernel/bpf/core.c
index 13e9dbeeedf3..3cd0ff66d39c 100644
--- i/kernel/bpf/core.c
+++ w/kernel/bpf/core.c
@@ -851,13 +851,22 @@ static LIST_HEAD(pack_list);
#define BPF_HPAGE_MASK PAGE_MASK
#endif

+static void *bpf_prog_pack_vmalloc(unsigned long size)
+{
+ return __vmalloc_node_range(size, MODULE_ALIGN,
+ MODULES_VADDR + get_module_load_offset(),
+ MODULES_END, gfp_mask, PAGE_KERNEL,
+ VM_DEFER_KMEMLEAK | VM_FORCE_HUGE_VMAP,
+ NUMA_NO_NODE, __builtin_return_address(0));
+}
+
static size_t select_bpf_prog_pack_size(void)
{
size_t size;
void *ptr;

size = BPF_HPAGE_SIZE * num_online_nodes();
- ptr = module_alloc(size);
+ ptr = bpf_prog_pack_vmalloc(size);

/* Test whether we can get huge pages. If not just use PAGE_SIZE
* packs.
@@ -881,7 +890,7 @@ static struct bpf_prog_pack *alloc_new_pack(void)
GFP_KERNEL);
if (!pack)
return NULL;
- pack->ptr = module_alloc(bpf_prog_pack_size);
+ pack->ptr = bpf_prog_pack_vmalloc(bpf_prog_pack_size);
if (!pack->ptr) {
kfree(pack);
return NULL;
diff --git i/mm/vmalloc.c w/mm/vmalloc.c
index e163372d3967..df2dd6779fa8 100644
--- i/mm/vmalloc.c
+++ w/mm/vmalloc.c
@@ -3106,7 +3106,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
return NULL;
}

- if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
+ if ((vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) ||
+ (vm_flags & VM_FORCE_HUGE_VMAP)) {
unsigned long size_per_node;

/*



>
> [0]
> https://lore.kernel.org/lkml/[email protected]/#t
>
> [1] Untested, but something like this possibly:
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 99e0f3e8d1a5..97c4ca3a29b1 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -42,6 +42,7 @@
> #include <linux/sched/mm.h>
> #include <asm/tlbflush.h>
> #include <asm/shmparam.h>
> +#include <linux/suspend.h>
>
> #include "internal.h"
> #include "pgalloc-track.h"
> @@ -2241,7 +2242,7 @@ EXPORT_SYMBOL(vm_map_ram);
>
> static struct vm_struct *vmlist __initdata;
>
> -static inline unsigned int vm_area_page_order(struct vm_struct *vm)
> +static inline unsigned int vm_area_page_order(const struct vm_struct
> *vm)
> {
> #ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> return vm->page_order;
> @@ -2560,12 +2561,12 @@ struct vm_struct *remove_vm_area(const void
> *addr)
> static inline void set_area_direct_map(const struct vm_struct *area,
> int (*set_direct_map)(struct
> page *page))
> {
> + unsigned int page_order = vm_area_page_order(area);
> int i;
>
> - /* HUGE_VMALLOC passes small pages to set_direct_map */
> - for (i = 0; i < area->nr_pages; i++)
> + for (i = 0; i < area->nr_pages; i += 1U << page_order)
> if (page_address(area->pages[i]))
> - set_direct_map(area->pages[i]);
> + set_direct_map(area->pages[i], 1U <<
> page_order);
> }
>
> /* Handle removing and resetting vm mappings related to the vm_struct.
> */
> @@ -2592,6 +2593,10 @@ static void vm_remove_mappings(struct vm_struct
> *area, int deallocate_pages)
> return;
> }
>
> + /* Hibernate can't handle large NP pages */
> + if (page_order)
> + lock_system_sleep();
> +
> /*
> * If execution gets here, flush the vm mapping and reset the
> direct
> * map. Find the start and end range of the direct mappings to
> make sure
> @@ -2617,6 +2622,9 @@ static void vm_remove_mappings(struct vm_struct
> *area, int deallocate_pages)
> set_area_direct_map(area, set_direct_map_invalid_noflush);
> _vm_unmap_aliases(start, end, flush_dmap);
> set_area_direct_map(area, set_direct_map_default_noflush);
> +
> + if (page_order)
> + unlock_system_sleep();
> }
>
> static void __vunmap(const void *addr, int deallocate_pages)

2022-03-29 00:37:31

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP

On Mon, 2022-03-28 at 23:27 +0000, Song Liu wrote:
> I like this direction. But I am afraid this is not enough. Using
> VM_NO_HUGE_VMAP in module_alloc() will make sure we don't allocate
> huge pages for modules. But other users of __vmalloc_node_range(),
> such as vzalloc in Paul's report, may still hit the issue.
>
> Maybe we need another flag VM_FORCE_HUGE_VMAP that bypasses
> vmap_allow_huge check. Something like the diff below.
>
> Would this work?

Yea, that looks like a safer direction. It's too bad we can't have
automatic large pages, but it doesn't seem ready to just turn on for
the whole x86 kernel.

I'm not sure about this implementation though. It would let large pages
get enabled without HAVE_ARCH_HUGE_VMALLOC and also despite the disable
kernel parameter.

Apparently some architectures can handle large pages automatically and
it has benefits for them, so maybe vmalloc should support both
behaviors based on config. Like there should a
ARCH_HUGE_VMALLOC_REQUIRE_FLAG config. If configured it requires
VM_HUGE_VMAP (or some name). I don't think FORCE fits, because the
current logic would not always give huge pages.

But yea, seems risky to leave it on generally, even if you could fix
Paul's specific issue.

2022-03-29 17:48:14

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP



> On Mar 28, 2022, at 5:18 PM, Edgecombe, Rick P <[email protected]> wrote:
>
> On Mon, 2022-03-28 at 23:27 +0000, Song Liu wrote:
>> I like this direction. But I am afraid this is not enough. Using
>> VM_NO_HUGE_VMAP in module_alloc() will make sure we don't allocate
>> huge pages for modules. But other users of __vmalloc_node_range(),
>> such as vzalloc in Paul's report, may still hit the issue.
>>
>> Maybe we need another flag VM_FORCE_HUGE_VMAP that bypasses
>> vmap_allow_huge check. Something like the diff below.
>>
>> Would this work?
>
> Yea, that looks like a safer direction. It's too bad we can't have
> automatic large pages, but it doesn't seem ready to just turn on for
> the whole x86 kernel.
>
> I'm not sure about this implementation though. It would let large pages
> get enabled without HAVE_ARCH_HUGE_VMALLOC and also despite the disable
> kernel parameter.
>
> Apparently some architectures can handle large pages automatically and
> it has benefits for them, so maybe vmalloc should support both
> behaviors based on config. Like there should a
> ARCH_HUGE_VMALLOC_REQUIRE_FLAG config. If configured it requires
> VM_HUGE_VMAP (or some name). I don't think FORCE fits, because the
> current logic would not always give huge pages.
>
> But yea, seems risky to leave it on generally, even if you could fix
> Paul's specific issue.
>


How about something like the following? (I still need to fix something, but
would like some feedbacks on the API).

Thanks,
Song


diff --git a/arch/Kconfig b/arch/Kconfig
index 84bc1de02720..defef50ff527 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -858,6 +858,15 @@ config HAVE_ARCH_HUGE_VMALLOC
depends on HAVE_ARCH_HUGE_VMAP
bool

+#
+# HAVE_ARCH_HUGE_VMALLOC_FLAG allows users of __vmalloc_node_range to allocate
+# huge page without HAVE_ARCH_HUGE_VMALLOC. To allocate huge pages,, the user
+# need to call __vmalloc_node_range with VM_PREFER_HUGE_VMAP.
+#
+config HAVE_ARCH_HUGE_VMALLOC_FLAG
+ depends on HAVE_ARCH_HUGE_VMAP
+ bool
+
config ARCH_WANT_HUGE_PMD_SHARE
bool

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7340d9f01b62..e64f00415575 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -161,7 +161,7 @@ config X86
select HAVE_ALIGNED_STRUCT_PAGE if SLUB
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE
- select HAVE_ARCH_HUGE_VMALLOC if X86_64
+ select HAVE_ARCH_HUGE_VMALLOC_FLAG if X86_64
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_JUMP_LABEL_RELATIVE
select HAVE_ARCH_KASAN if X86_64
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 3b1df7da402d..98f8a93fcfd4 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -35,6 +35,11 @@ struct notifier_block; /* in notifier.h */
#define VM_DEFER_KMEMLEAK 0
#endif

+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG
+#define VM_PREFER_HUGE_VMAP 0x00001000 /* prefer PMD_SIZE mapping (bypass vmap_allow_huge check) */
+#else
+#define VM_PREFER_HUGE_VMAP 0
+#endif
/* bits [20..32] reserved for arch specific ioremap internals */

/*
@@ -51,7 +56,7 @@ struct vm_struct {
unsigned long size;
unsigned long flags;
struct page **pages;
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
unsigned int page_order;
#endif
unsigned int nr_pages;
@@ -225,7 +230,7 @@ static inline bool is_vm_area_hugepages(const void *addr)
* prevents that. This only indicates the size of the physical page
* allocated in the vmalloc layer.
*/
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
return find_vm_area(addr)->page_order > 0;
#else
return false;
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 13e9dbeeedf3..fc9dae095079 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -851,13 +851,28 @@ static LIST_HEAD(pack_list);
#define BPF_HPAGE_MASK PAGE_MASK
#endif

+static void *bpf_prog_pack_vmalloc(unsigned long size)
+{
+#if defined(MODULES_VADDR)
+ unsigned long start = MODULES_VADDR;
+ unsigned long end = MODULES_END;
+#else
+ unsigned long start = VMALLOC_START;
+ unsigned long end = VMALLOC_END;
+#endif
+
+ return __vmalloc_node_range(size, PAGE_SIZE, start, end, GFP_KERNEL, PAGE_KERNEL,
+ VM_DEFER_KMEMLEAK | VM_PREFER_HUGE_VMAP,
+ NUMA_NO_NODE, __builtin_return_address(0));
+}
+
static size_t select_bpf_prog_pack_size(void)
{
size_t size;
void *ptr;

size = BPF_HPAGE_SIZE * num_online_nodes();
- ptr = module_alloc(size);
+ ptr = bpf_prog_pack_vmalloc(size);

/* Test whether we can get huge pages. If not just use PAGE_SIZE
* packs.
@@ -881,7 +896,7 @@ static struct bpf_prog_pack *alloc_new_pack(void)
GFP_KERNEL);
if (!pack)
return NULL;
- pack->ptr = module_alloc(bpf_prog_pack_size);
+ pack->ptr = bpf_prog_pack_vmalloc(bpf_prog_pack_size);
if (!pack->ptr) {
kfree(pack);
return NULL;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e163372d3967..9d3c1ab8bdf1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2261,7 +2261,7 @@ static inline unsigned int vm_area_page_order(struct vm_struct *vm)

static inline void set_vm_area_page_order(struct vm_struct *vm, unsigned int order)
{
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
vm->page_order = order;
#else
BUG_ON(order != 0);
@@ -3106,7 +3106,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
return NULL;
}

- if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
+ if ((vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) ||
+ (IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG) && (vm_flags & VM_PREFER_HUGE_VMAP))) {
unsigned long size_per_node;

/*

2022-03-30 09:52:04

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP

On Tue, 2022-03-29 at 19:13 +0000, Song Liu wrote:
> >
> > I don't think we should tie this to vmap_allow_huge by definition.
> > Also, what it does is try 2MB pages for allocations larger than
> > 2MB.
> > For smaller allocations it doesn't try or "prefer" them.
>
> How about something like:
>
> #define VM_TRY_HUGE_VMAP 0x00001000 /* try PMD_SIZE
> mapping if size-per-node >= PMD_SIZE */

Seems reasonable name. I don't know if "size-per-node >= PMD_SIZE" is
going to be useful information. Maybe something like:

/* Allow for huge pages on HAVE_ARCH_HUGE_VMALLOC_FLAG arch's */

2022-03-30 11:57:41

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP

CC some vmalloc folks. What happened was vmalloc huge pages was turned
on for x86 with the rest of the kernel unprepared and problems have
popped up. We are discussing a possible new config and vmap flag such
that for some arch's, huge pages would only be used for certain
allocations such as BPF's.

Thread:

https://lore.kernel.org/lkml/[email protected]/

On Tue, 2022-03-29 at 08:23 +0000, Song Liu wrote:
> > On Mar 28, 2022, at 5:18 PM, Edgecombe, Rick P <
> > [email protected]> wrote:
> >
> > On Mon, 2022-03-28 at 23:27 +0000, Song Liu wrote:
> > > I like this direction. But I am afraid this is not enough. Using
> > > VM_NO_HUGE_VMAP in module_alloc() will make sure we don't
> > > allocate
> > > huge pages for modules. But other users of
> > > __vmalloc_node_range(),
> > > such as vzalloc in Paul's report, may still hit the issue.
> > >
> > > Maybe we need another flag VM_FORCE_HUGE_VMAP that bypasses
> > > vmap_allow_huge check. Something like the diff below.
> > >
> > > Would this work?
> >
> > Yea, that looks like a safer direction. It's too bad we can't have
> > automatic large pages, but it doesn't seem ready to just turn on
> > for
> > the whole x86 kernel.
> >
> > I'm not sure about this implementation though. It would let large
> > pages
> > get enabled without HAVE_ARCH_HUGE_VMALLOC and also despite the
> > disable
> > kernel parameter.
> >
> > Apparently some architectures can handle large pages automatically
> > and
> > it has benefits for them, so maybe vmalloc should support both
> > behaviors based on config. Like there should a
> > ARCH_HUGE_VMALLOC_REQUIRE_FLAG config. If configured it requires
> > VM_HUGE_VMAP (or some name). I don't think FORCE fits, because the
> > current logic would not always give huge pages.
> >
> > But yea, seems risky to leave it on generally, even if you could
> > fix
> > Paul's specific issue.
> >
>
>
> How about something like the following? (I still need to fix
> something, but
> would like some feedbacks on the API).
>
> Thanks,
> Song
>
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 84bc1de02720..defef50ff527 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -858,6 +858,15 @@ config HAVE_ARCH_HUGE_VMALLOC
> depends on HAVE_ARCH_HUGE_VMAP
> bool
>
> +#
> +# HAVE_ARCH_HUGE_VMALLOC_FLAG allows users of __vmalloc_node_range
> to allocate
> +# huge page without HAVE_ARCH_HUGE_VMALLOC. To allocate huge pages,,
> the user
> +# need to call __vmalloc_node_range with VM_PREFER_HUGE_VMAP.
> +#
> +config HAVE_ARCH_HUGE_VMALLOC_FLAG
> + depends on HAVE_ARCH_HUGE_VMAP
> + bool
> +
> config ARCH_WANT_HUGE_PMD_SHARE
> bool
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 7340d9f01b62..e64f00415575 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -161,7 +161,7 @@ config X86
> select HAVE_ALIGNED_STRUCT_PAGE if SLUB
> select HAVE_ARCH_AUDITSYSCALL
> select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE
> - select HAVE_ARCH_HUGE_VMALLOC if X86_64
> + select HAVE_ARCH_HUGE_VMALLOC_FLAG if X86_64
> select HAVE_ARCH_JUMP_LABEL
> select HAVE_ARCH_JUMP_LABEL_RELATIVE
> select HAVE_ARCH_KASAN if X86_64
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 3b1df7da402d..98f8a93fcfd4 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -35,6 +35,11 @@ struct notifier_block; /* in
> notifier.h */
> #define VM_DEFER_KMEMLEAK 0
> #endif
>
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG
> +#define VM_PREFER_HUGE_VMAP 0x00001000 /* prefer PMD_SIZE
> mapping (bypass vmap_allow_huge check) */

I don't think we should tie this to vmap_allow_huge by definition.
Also, what it does is try 2MB pages for allocations larger than 2MB.
For smaller allocations it doesn't try or "prefer" them.

> +#else
> +#define VM_PREFER_HUGE_VMAP 0
> +#endif
> /* bits [20..32] reserved for arch specific ioremap internals */
>
> /*
> @@ -51,7 +56,7 @@ struct vm_struct {
> unsigned long size;
> unsigned long flags;
> struct page **pages;
> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
> unsigned int page_order;
> #endif
> unsigned int nr_pages;
> @@ -225,7 +230,7 @@ static inline bool is_vm_area_hugepages(const
> void *addr)
> * prevents that. This only indicates the size of the physical
> page
> * allocated in the vmalloc layer.
> */
> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
> return find_vm_area(addr)->page_order > 0;
> #else
> return false;
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 13e9dbeeedf3..fc9dae095079 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -851,13 +851,28 @@ static LIST_HEAD(pack_list);
> #define BPF_HPAGE_MASK PAGE_MASK
> #endif
>
> +static void *bpf_prog_pack_vmalloc(unsigned long size)
> +{
> +#if defined(MODULES_VADDR)
> + unsigned long start = MODULES_VADDR;
> + unsigned long end = MODULES_END;
> +#else
> + unsigned long start = VMALLOC_START;
> + unsigned long end = VMALLOC_END;
> +#endif
> +
> + return __vmalloc_node_range(size, PAGE_SIZE, start, end,
> GFP_KERNEL, PAGE_KERNEL,
> + VM_DEFER_KMEMLEAK |
> VM_PREFER_HUGE_VMAP,
> + NUMA_NO_NODE,
> __builtin_return_address(0));
> +}
> +
> static size_t select_bpf_prog_pack_size(void)
> {
> size_t size;
> void *ptr;
>
> size = BPF_HPAGE_SIZE * num_online_nodes();
> - ptr = module_alloc(size);
> + ptr = bpf_prog_pack_vmalloc(size);
>
> /* Test whether we can get huge pages. If not just use
> PAGE_SIZE
> * packs.
> @@ -881,7 +896,7 @@ static struct bpf_prog_pack *alloc_new_pack(void)
> GFP_KERNEL);
> if (!pack)
> return NULL;
> - pack->ptr = module_alloc(bpf_prog_pack_size);
> + pack->ptr = bpf_prog_pack_vmalloc(bpf_prog_pack_size);
> if (!pack->ptr) {
> kfree(pack);
> return NULL;
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index e163372d3967..9d3c1ab8bdf1 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2261,7 +2261,7 @@ static inline unsigned int
> vm_area_page_order(struct vm_struct *vm)
>
> static inline void set_vm_area_page_order(struct vm_struct *vm,
> unsigned int order)
> {
> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
> vm->page_order = order;
> #else
> BUG_ON(order != 0);
> @@ -3106,7 +3106,8 @@ void *__vmalloc_node_range(unsigned long size,
> unsigned long align,
> return NULL;
> }
>
> - if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
> + if ((vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) ||
> + (IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG) &&
> (vm_flags & VM_PREFER_HUGE_VMAP))) {

vmap_allow_huge is how the kernel parameter that can disable vmalloc
huge pages works. I don't think we should separate it. Disabling
vmalloc huge pages should still disable this alternate mode.

> unsigned long size_per_node;
>
> /*

2022-03-30 12:18:39

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP



> On Mar 29, 2022, at 2:36 PM, Edgecombe, Rick P <[email protected]> wrote:
>
> On Tue, 2022-03-29 at 19:13 +0000, Song Liu wrote:
>>>
>>> I don't think we should tie this to vmap_allow_huge by definition.
>>> Also, what it does is try 2MB pages for allocations larger than
>>> 2MB.
>>> For smaller allocations it doesn't try or "prefer" them.
>>
>> How about something like:
>>
>> #define VM_TRY_HUGE_VMAP 0x00001000 /* try PMD_SIZE
>> mapping if size-per-node >= PMD_SIZE */
>
> Seems reasonable name. I don't know if "size-per-node >= PMD_SIZE" is
> going to be useful information. Maybe something like:
>
> /* Allow for huge pages on HAVE_ARCH_HUGE_VMALLOC_FLAG arch's */

Sounds great. I updated the version based on this (below). I will split it
into two patches, I guess.

@Paul, could you please test whether this version fixes the issue you were
seeing?

Thanks,
Song



diff --git a/arch/Kconfig b/arch/Kconfig
index 84bc1de02720..0fb2bd5fd1f8 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -858,6 +858,15 @@ config HAVE_ARCH_HUGE_VMALLOC
depends on HAVE_ARCH_HUGE_VMAP
bool

+#
+# HAVE_ARCH_HUGE_VMALLOC_FLAG allows users of __vmalloc_node_range to allocate
+# huge page without HAVE_ARCH_HUGE_VMALLOC. To allocate huge pages, the user
+# need to call __vmalloc_node_range with VM_TRY_HUGE_VMAP.
+#
+config HAVE_ARCH_HUGE_VMALLOC_FLAG
+ depends on HAVE_ARCH_HUGE_VMAP
+ bool
+
config ARCH_WANT_HUGE_PMD_SHARE
bool

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7340d9f01b62..e64f00415575 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -45,7 +45,7 @@ config FORCE_DYNAMIC_FTRACE
in order to test the non static function tracing in the
generic code, as other architectures still use it. But we
only need to keep it around for x86_64. No need to keep it
- for x86_32. For x86_32, force DYNAMIC_FTRACE.
+ for x86_32. For x86_32, force DYNAMIC_FTRACE.
#
# Arch settings
#
@@ -161,7 +161,7 @@ config X86
select HAVE_ALIGNED_STRUCT_PAGE if SLUB
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE
- select HAVE_ARCH_HUGE_VMALLOC if X86_64
+ select HAVE_ARCH_HUGE_VMALLOC_FLAG if X86_64
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_JUMP_LABEL_RELATIVE
select HAVE_ARCH_KASAN if X86_64
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 3b1df7da402d..a48d0690b66f 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -35,6 +35,11 @@ struct notifier_block; /* in notifier.h */
#define VM_DEFER_KMEMLEAK 0
#endif

+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG
+#define VM_TRY_HUGE_VMAP 0x00001000 /* Allow for huge pages on HAVE_ARCH_HUGE_VMALLOC_FLAG arch's */
+#else
+#define VM_TRY_HUGE_VMAP 0
+#endif
/* bits [20..32] reserved for arch specific ioremap internals */

/*
@@ -51,7 +56,7 @@ struct vm_struct {
unsigned long size;
unsigned long flags;
struct page **pages;
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
unsigned int page_order;
#endif
unsigned int nr_pages;
@@ -225,7 +230,7 @@ static inline bool is_vm_area_hugepages(const void *addr)
* prevents that. This only indicates the size of the physical page
* allocated in the vmalloc layer.
*/
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
return find_vm_area(addr)->page_order > 0;
#else
return false;
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 13e9dbeeedf3..5659677b18f3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -851,13 +851,28 @@ static LIST_HEAD(pack_list);
#define BPF_HPAGE_MASK PAGE_MASK
#endif

+static void *bpf_prog_pack_vmalloc(unsigned long size)
+{
+#if defined(MODULES_VADDR)
+ unsigned long start = MODULES_VADDR;
+ unsigned long end = MODULES_END;
+#else
+ unsigned long start = VMALLOC_START;
+ unsigned long end = VMALLOC_END;
+#endif
+
+ return __vmalloc_node_range(size, PAGE_SIZE, start, end, GFP_KERNEL, PAGE_KERNEL,
+ VM_DEFER_KMEMLEAK | VM_TRY_HUGE_VMAP,
+ NUMA_NO_NODE, __builtin_return_address(0));
+}
+
static size_t select_bpf_prog_pack_size(void)
{
size_t size;
void *ptr;

size = BPF_HPAGE_SIZE * num_online_nodes();
- ptr = module_alloc(size);
+ ptr = bpf_prog_pack_vmalloc(size);

/* Test whether we can get huge pages. If not just use PAGE_SIZE
* packs.
@@ -881,7 +896,7 @@ static struct bpf_prog_pack *alloc_new_pack(void)
GFP_KERNEL);
if (!pack)
return NULL;
- pack->ptr = module_alloc(bpf_prog_pack_size);
+ pack->ptr = bpf_prog_pack_vmalloc(bpf_prog_pack_size);
if (!pack->ptr) {
kfree(pack);
return NULL;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e163372d3967..179200bce285 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -46,7 +46,7 @@
#include "internal.h"
#include "pgalloc-track.h"

-#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
static unsigned int __ro_after_init ioremap_max_page_shift = BITS_PER_LONG - 1;

static int __init set_nohugeiomap(char *str)
@@ -55,11 +55,11 @@ static int __init set_nohugeiomap(char *str)
return 0;
}
early_param("nohugeiomap", set_nohugeiomap);
-#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+#else /* CONFIG_HAVE_ARCH_HUGE_VMAP || CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG */
static const unsigned int ioremap_max_page_shift = PAGE_SHIFT;
-#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP || CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG*/

-#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
static bool __ro_after_init vmap_allow_huge = true;

static int __init set_nohugevmalloc(char *str)
@@ -582,8 +582,9 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end,

WARN_ON(page_shift < PAGE_SHIFT);

- if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
- page_shift == PAGE_SHIFT)
+ if ((!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) &&
+ !IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG)) ||
+ (page_shift == PAGE_SHIFT))
return vmap_small_pages_range_noflush(addr, end, prot, pages);

for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
@@ -2252,7 +2253,7 @@ static struct vm_struct *vmlist __initdata;

static inline unsigned int vm_area_page_order(struct vm_struct *vm)
{
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
return vm->page_order;
#else
return 0;
@@ -2261,7 +2262,7 @@ static inline unsigned int vm_area_page_order(struct vm_struct *vm)

static inline void set_vm_area_page_order(struct vm_struct *vm, unsigned int order)
{
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
vm->page_order = order;
#else
BUG_ON(order != 0);
@@ -3056,6 +3057,15 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
return NULL;
}

+static bool vmalloc_try_huge_page(unsigned long vm_flags)
+{
+ if (!vmap_allow_huge || (vm_flags & VM_NO_HUGE_VMAP))
+ return false;
+
+ /* VM_TRY_HUGE_VMAP only works for CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG */
+ return vm_flags & VM_TRY_HUGE_VMAP;
+}
+
/**
* __vmalloc_node_range - allocate virtually contiguous memory
* @size: allocation size
@@ -3106,7 +3116,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
return NULL;
}

- if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
+ if (vmalloc_try_huge_page(vm_flags)) {
unsigned long size_per_node;

/*

2022-03-31 02:43:06

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP



> On Mar 29, 2022, at 11:39 AM, Edgecombe, Rick P <[email protected]> wrote:
>
> CC some vmalloc folks. What happened was vmalloc huge pages was turned
> on for x86 with the rest of the kernel unprepared and problems have
> popped up. We are discussing a possible new config and vmap flag such
> that for some arch's, huge pages would only be used for certain
> allocations such as BPF's.
>
> Thread:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> On Tue, 2022-03-29 at 08:23 +0000, Song Liu wrote:
>>> On Mar 28, 2022, at 5:18 PM, Edgecombe, Rick P <
>>> [email protected]> wrote:
>>>
>>> On Mon, 2022-03-28 at 23:27 +0000, Song Liu wrote:
>>>> I like this direction. But I am afraid this is not enough. Using
>>>> VM_NO_HUGE_VMAP in module_alloc() will make sure we don't
>>>> allocate
>>>> huge pages for modules. But other users of
>>>> __vmalloc_node_range(),
>>>> such as vzalloc in Paul's report, may still hit the issue.
>>>>
>>>> Maybe we need another flag VM_FORCE_HUGE_VMAP that bypasses
>>>> vmap_allow_huge check. Something like the diff below.
>>>>
>>>> Would this work?
>>>
>>> Yea, that looks like a safer direction. It's too bad we can't have
>>> automatic large pages, but it doesn't seem ready to just turn on
>>> for
>>> the whole x86 kernel.
>>>
>>> I'm not sure about this implementation though. It would let large
>>> pages
>>> get enabled without HAVE_ARCH_HUGE_VMALLOC and also despite the
>>> disable
>>> kernel parameter.
>>>
>>> Apparently some architectures can handle large pages automatically
>>> and
>>> it has benefits for them, so maybe vmalloc should support both
>>> behaviors based on config. Like there should a
>>> ARCH_HUGE_VMALLOC_REQUIRE_FLAG config. If configured it requires
>>> VM_HUGE_VMAP (or some name). I don't think FORCE fits, because the
>>> current logic would not always give huge pages.
>>>
>>> But yea, seems risky to leave it on generally, even if you could
>>> fix
>>> Paul's specific issue.
>>>
>>
>>
>> How about something like the following? (I still need to fix
>> something, but
>> would like some feedbacks on the API).
>>
>> Thanks,
>> Song
>>
>>
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index 84bc1de02720..defef50ff527 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -858,6 +858,15 @@ config HAVE_ARCH_HUGE_VMALLOC
>> depends on HAVE_ARCH_HUGE_VMAP
>> bool
>>
>> +#
>> +# HAVE_ARCH_HUGE_VMALLOC_FLAG allows users of __vmalloc_node_range
>> to allocate
>> +# huge page without HAVE_ARCH_HUGE_VMALLOC. To allocate huge pages,,
>> the user
>> +# need to call __vmalloc_node_range with VM_PREFER_HUGE_VMAP.
>> +#
>> +config HAVE_ARCH_HUGE_VMALLOC_FLAG
>> + depends on HAVE_ARCH_HUGE_VMAP
>> + bool
>> +
>> config ARCH_WANT_HUGE_PMD_SHARE
>> bool
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 7340d9f01b62..e64f00415575 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -161,7 +161,7 @@ config X86
>> select HAVE_ALIGNED_STRUCT_PAGE if SLUB
>> select HAVE_ARCH_AUDITSYSCALL
>> select HAVE_ARCH_HUGE_VMAP if X86_64 || X86_PAE
>> - select HAVE_ARCH_HUGE_VMALLOC if X86_64
>> + select HAVE_ARCH_HUGE_VMALLOC_FLAG if X86_64
>> select HAVE_ARCH_JUMP_LABEL
>> select HAVE_ARCH_JUMP_LABEL_RELATIVE
>> select HAVE_ARCH_KASAN if X86_64
>> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
>> index 3b1df7da402d..98f8a93fcfd4 100644
>> --- a/include/linux/vmalloc.h
>> +++ b/include/linux/vmalloc.h
>> @@ -35,6 +35,11 @@ struct notifier_block; /* in
>> notifier.h */
>> #define VM_DEFER_KMEMLEAK 0
>> #endif
>>
>> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG
>> +#define VM_PREFER_HUGE_VMAP 0x00001000 /* prefer PMD_SIZE
>> mapping (bypass vmap_allow_huge check) */
>
> I don't think we should tie this to vmap_allow_huge by definition.
> Also, what it does is try 2MB pages for allocations larger than 2MB.
> For smaller allocations it doesn't try or "prefer" them.

How about something like:

#define VM_TRY_HUGE_VMAP 0x00001000 /* try PMD_SIZE mapping if size-per-node >= PMD_SIZE */

>
>> +#else
>> +#define VM_PREFER_HUGE_VMAP 0
>> +#endif
>> /* bits [20..32] reserved for arch specific ioremap internals */
>>
>> /*
>> @@ -51,7 +56,7 @@ struct vm_struct {
>> unsigned long size;
>> unsigned long flags;
>> struct page **pages;
>> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
>> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
>> unsigned int page_order;
>> #endif
>> unsigned int nr_pages;
>> @@ -225,7 +230,7 @@ static inline bool is_vm_area_hugepages(const
>> void *addr)
>> * prevents that. This only indicates the size of the physical
>> page
>> * allocated in the vmalloc layer.
>> */
>> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
>> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
>> return find_vm_area(addr)->page_order > 0;
>> #else
>> return false;
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 13e9dbeeedf3..fc9dae095079 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -851,13 +851,28 @@ static LIST_HEAD(pack_list);
>> #define BPF_HPAGE_MASK PAGE_MASK
>> #endif
>>
>> +static void *bpf_prog_pack_vmalloc(unsigned long size)
>> +{
>> +#if defined(MODULES_VADDR)
>> + unsigned long start = MODULES_VADDR;
>> + unsigned long end = MODULES_END;
>> +#else
>> + unsigned long start = VMALLOC_START;
>> + unsigned long end = VMALLOC_END;
>> +#endif
>> +
>> + return __vmalloc_node_range(size, PAGE_SIZE, start, end,
>> GFP_KERNEL, PAGE_KERNEL,
>> + VM_DEFER_KMEMLEAK |
>> VM_PREFER_HUGE_VMAP,
>> + NUMA_NO_NODE,
>> __builtin_return_address(0));
>> +}
>> +
>> static size_t select_bpf_prog_pack_size(void)
>> {
>> size_t size;
>> void *ptr;
>>
>> size = BPF_HPAGE_SIZE * num_online_nodes();
>> - ptr = module_alloc(size);
>> + ptr = bpf_prog_pack_vmalloc(size);
>>
>> /* Test whether we can get huge pages. If not just use
>> PAGE_SIZE
>> * packs.
>> @@ -881,7 +896,7 @@ static struct bpf_prog_pack *alloc_new_pack(void)
>> GFP_KERNEL);
>> if (!pack)
>> return NULL;
>> - pack->ptr = module_alloc(bpf_prog_pack_size);
>> + pack->ptr = bpf_prog_pack_vmalloc(bpf_prog_pack_size);
>> if (!pack->ptr) {
>> kfree(pack);
>> return NULL;
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index e163372d3967..9d3c1ab8bdf1 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -2261,7 +2261,7 @@ static inline unsigned int
>> vm_area_page_order(struct vm_struct *vm)
>>
>> static inline void set_vm_area_page_order(struct vm_struct *vm,
>> unsigned int order)
>> {
>> -#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
>> +#if (defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
>> defined(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG))
>> vm->page_order = order;
>> #else
>> BUG_ON(order != 0);
>> @@ -3106,7 +3106,8 @@ void *__vmalloc_node_range(unsigned long size,
>> unsigned long align,
>> return NULL;
>> }
>>
>> - if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
>> + if ((vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) ||
>> + (IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC_FLAG) &&
>> (vm_flags & VM_PREFER_HUGE_VMAP))) {
>
> vmap_allow_huge is how the kernel parameter that can disable vmalloc
> huge pages works. I don't think we should separate it. Disabling
> vmalloc huge pages should still disable this alternate mode.

Yeah, this makes sense. I will fix it.

>
>> unsigned long size_per_node;
>>
>> /*