2022-04-16 01:28:10

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Fri, Apr 15, 2022 at 09:44:09AM -0700, Song Liu wrote:
> Changes v3 => v4:
> 1. Fix __weak module_alloc_huge; remove unused vmalloc_huge; rename
> __vmalloc_huge => vmalloc_huge. (Christoph Hellwig)
> 2. Use vzalloc (as it was before vmalloc_no_huge) and clean up comments in
> kvm_s390_pv_alloc_vm.
>
> Changes v2 => v3:
> 1. Use __vmalloc_huge in alloc_large_system_hash.
> 2. Use EXPORT_SYMBOL_GPL for new functions. (Christoph Hellwig)
> 3. Add more description about the issues and changes.(Christoph Hellwig,
> Rick Edgecombe).
>
> Changes v1 => v2:
> 1. Add vmalloc_huge(). (Christoph Hellwig)
> 2. Add module_alloc_huge(). (Christoph Hellwig)
> 3. Add Fixes tag and Link tag. (Thorsten Leemhuis)
>
> Enabling HAVE_ARCH_HUGE_VMALLOC on x86_64 and use it for bpf_prog_pack has
> caused some issues [1], as many users of vmalloc are not yet ready to
> handle huge pages. To enable a more smooth transition to use huge page
> backed vmalloc memory, this set replaces VM_NO_HUGE_VMAP flag with an new
> opt-in flag, VM_ALLOW_HUGE_VMAP. More discussions about this topic can be
> found at [2].
>
> Patch 1 removes VM_NO_HUGE_VMAP and adds VM_ALLOW_HUGE_VMAP.
> Patch 2 uses VM_ALLOW_HUGE_VMAP in bpf_prog_pack.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/linux-mm/[email protected]/

Looks good except for that I think this should just wait for v5.19. The
fixes are so large I can't see why this needs to be rushed in other than
the first assumptions of the optimizations had some flaws addressed here.

If this goes through v5.19 expect conflicts on modules unless you use
modules-testing.

Luis


2022-04-16 02:50:26

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Fri, Apr 15, 2022 at 12:05 PM Luis Chamberlain <[email protected]> wrote:
>
> On Fri, Apr 15, 2022 at 09:44:09AM -0700, Song Liu wrote:
> > Changes v3 => v4:
> > 1. Fix __weak module_alloc_huge; remove unused vmalloc_huge; rename
> > __vmalloc_huge => vmalloc_huge. (Christoph Hellwig)
> > 2. Use vzalloc (as it was before vmalloc_no_huge) and clean up comments in
> > kvm_s390_pv_alloc_vm.
> >
> > Changes v2 => v3:
> > 1. Use __vmalloc_huge in alloc_large_system_hash.
> > 2. Use EXPORT_SYMBOL_GPL for new functions. (Christoph Hellwig)
> > 3. Add more description about the issues and changes.(Christoph Hellwig,
> > Rick Edgecombe).
> >
> > Changes v1 => v2:
> > 1. Add vmalloc_huge(). (Christoph Hellwig)
> > 2. Add module_alloc_huge(). (Christoph Hellwig)
> > 3. Add Fixes tag and Link tag. (Thorsten Leemhuis)
> >
> > Enabling HAVE_ARCH_HUGE_VMALLOC on x86_64 and use it for bpf_prog_pack has
> > caused some issues [1], as many users of vmalloc are not yet ready to
> > handle huge pages. To enable a more smooth transition to use huge page
> > backed vmalloc memory, this set replaces VM_NO_HUGE_VMAP flag with an new
> > opt-in flag, VM_ALLOW_HUGE_VMAP. More discussions about this topic can be
> > found at [2].
> >
> > Patch 1 removes VM_NO_HUGE_VMAP and adds VM_ALLOW_HUGE_VMAP.
> > Patch 2 uses VM_ALLOW_HUGE_VMAP in bpf_prog_pack.
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> > [2] https://lore.kernel.org/linux-mm/[email protected]/
>
> Looks good except for that I think this should just wait for v5.19. The
> fixes are so large I can't see why this needs to be rushed in other than
> the first assumptions of the optimizations had some flaws addressed here.

We need these changes to fix issues like [3]. Note that there might
still be some
undiscovered issues with huge page backed vmalloc memory on powerpc, which
had HAVE_ARCH_HUGE_VMALLOC enabled since the 5.15 kernel. As we
agreed, the new opt-in flag is a safer approach here. We probably should have
1/4 and 2/4 back ported to stable. Therefore, I think shipping this
set now would
give us a more reliable 5.18 release.

Does this make sense?

Thanks,
Song

[3] https://lore.kernel.org/lkml/[email protected]/

2022-04-16 02:58:13

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Fri, Apr 15, 2022 at 06:34:16PM -0700, Song Liu wrote:
> On Fri, Apr 15, 2022 at 12:05 PM Luis Chamberlain <[email protected]> wrote:
> >
> > On Fri, Apr 15, 2022 at 09:44:09AM -0700, Song Liu wrote:
> > > Changes v3 => v4:
> > > 1. Fix __weak module_alloc_huge; remove unused vmalloc_huge; rename
> > > __vmalloc_huge => vmalloc_huge. (Christoph Hellwig)
> > > 2. Use vzalloc (as it was before vmalloc_no_huge) and clean up comments in
> > > kvm_s390_pv_alloc_vm.
> > >
> > > Changes v2 => v3:
> > > 1. Use __vmalloc_huge in alloc_large_system_hash.
> > > 2. Use EXPORT_SYMBOL_GPL for new functions. (Christoph Hellwig)
> > > 3. Add more description about the issues and changes.(Christoph Hellwig,
> > > Rick Edgecombe).
> > >
> > > Changes v1 => v2:
> > > 1. Add vmalloc_huge(). (Christoph Hellwig)
> > > 2. Add module_alloc_huge(). (Christoph Hellwig)
> > > 3. Add Fixes tag and Link tag. (Thorsten Leemhuis)
> > >
> > > Enabling HAVE_ARCH_HUGE_VMALLOC on x86_64 and use it for bpf_prog_pack has
> > > caused some issues [1], as many users of vmalloc are not yet ready to
> > > handle huge pages. To enable a more smooth transition to use huge page
> > > backed vmalloc memory, this set replaces VM_NO_HUGE_VMAP flag with an new
> > > opt-in flag, VM_ALLOW_HUGE_VMAP. More discussions about this topic can be
> > > found at [2].
> > >
> > > Patch 1 removes VM_NO_HUGE_VMAP and adds VM_ALLOW_HUGE_VMAP.
> > > Patch 2 uses VM_ALLOW_HUGE_VMAP in bpf_prog_pack.
> > >
> > > [1] https://lore.kernel.org/lkml/[email protected]/
> > > [2] https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Looks good except for that I think this should just wait for v5.19. The
> > fixes are so large I can't see why this needs to be rushed in other than
> > the first assumptions of the optimizations had some flaws addressed here.
>
> We need these changes to fix issues like [3]. Note that there might
> still be some
> undiscovered issues with huge page backed vmalloc memory on powerpc, which
> had HAVE_ARCH_HUGE_VMALLOC enabled since the 5.15 kernel. As we
> agreed, the new opt-in flag is a safer approach here. We probably should have
> 1/4 and 2/4 back ported to stable. Therefore, I think shipping this
> set now would
> give us a more reliable 5.18 release.
>
> Does this make sense?

Yes absolutely, but that sounds like that optimization should just be
reverted completely from v5.18 isntead then no? Or if one can skip the
optimizations for v5.18 with a small patch, and then enable for v5.19.

Luis

2022-04-16 02:58:13

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Fri, Apr 15, 2022 at 06:42:27PM -0700, Luis Chamberlain wrote:
> On Fri, Apr 15, 2022 at 06:34:16PM -0700, Song Liu wrote:
> > On Fri, Apr 15, 2022 at 12:05 PM Luis Chamberlain <[email protected]> wrote:
> > >
> > > On Fri, Apr 15, 2022 at 09:44:09AM -0700, Song Liu wrote:
> > > > Changes v3 => v4:
> > > > 1. Fix __weak module_alloc_huge; remove unused vmalloc_huge; rename
> > > > __vmalloc_huge => vmalloc_huge. (Christoph Hellwig)
> > > > 2. Use vzalloc (as it was before vmalloc_no_huge) and clean up comments in
> > > > kvm_s390_pv_alloc_vm.
> > > >
> > > > Changes v2 => v3:
> > > > 1. Use __vmalloc_huge in alloc_large_system_hash.
> > > > 2. Use EXPORT_SYMBOL_GPL for new functions. (Christoph Hellwig)
> > > > 3. Add more description about the issues and changes.(Christoph Hellwig,
> > > > Rick Edgecombe).
> > > >
> > > > Changes v1 => v2:
> > > > 1. Add vmalloc_huge(). (Christoph Hellwig)
> > > > 2. Add module_alloc_huge(). (Christoph Hellwig)
> > > > 3. Add Fixes tag and Link tag. (Thorsten Leemhuis)
> > > >
> > > > Enabling HAVE_ARCH_HUGE_VMALLOC on x86_64 and use it for bpf_prog_pack has
> > > > caused some issues [1], as many users of vmalloc are not yet ready to
> > > > handle huge pages. To enable a more smooth transition to use huge page
> > > > backed vmalloc memory, this set replaces VM_NO_HUGE_VMAP flag with an new
> > > > opt-in flag, VM_ALLOW_HUGE_VMAP. More discussions about this topic can be
> > > > found at [2].
> > > >
> > > > Patch 1 removes VM_NO_HUGE_VMAP and adds VM_ALLOW_HUGE_VMAP.
> > > > Patch 2 uses VM_ALLOW_HUGE_VMAP in bpf_prog_pack.
> > > >
> > > > [1] https://lore.kernel.org/lkml/[email protected]/
> > > > [2] https://lore.kernel.org/linux-mm/[email protected]/
> > >
> > > Looks good except for that I think this should just wait for v5.19. The
> > > fixes are so large I can't see why this needs to be rushed in other than
> > > the first assumptions of the optimizations had some flaws addressed here.
> >
> > We need these changes to fix issues like [3]. Note that there might
> > still be some
> > undiscovered issues with huge page backed vmalloc memory on powerpc, which
> > had HAVE_ARCH_HUGE_VMALLOC enabled since the 5.15 kernel. As we
> > agreed, the new opt-in flag is a safer approach here. We probably should have
> > 1/4 and 2/4 back ported to stable. Therefore, I think shipping this
> > set now would
> > give us a more reliable 5.18 release.
> >
> > Does this make sense?
>
> Yes absolutely, but that sounds like that optimization should just be
> reverted completely from v5.18 isntead then no? Or if one can skip the
> optimizations for v5.18 with a small patch, and then enable for v5.19.

In any case it is up to Linus, I already chimed in with my opinion.

Luis

2022-04-16 05:08:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Fri, Apr 15, 2022 at 12:05:42PM -0700, Luis Chamberlain wrote:
> Looks good except for that I think this should just wait for v5.19. The
> fixes are so large I can't see why this needs to be rushed in other than
> the first assumptions of the optimizations had some flaws addressed here.

Patches 1 and 2 are bug fixes for regressions caused by using huge page
backed vmalloc by default. So I think we do need it for 5.18. The
other two do look like candidates for 5.19, though.

2022-04-18 05:34:17

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP



> On Apr 15, 2022, at 10:08 PM, Christoph Hellwig <[email protected]> wrote:
>
> On Fri, Apr 15, 2022 at 12:05:42PM -0700, Luis Chamberlain wrote:
>> Looks good except for that I think this should just wait for v5.19. The
>> fixes are so large I can't see why this needs to be rushed in other than
>> the first assumptions of the optimizations had some flaws addressed here.
>
> Patches 1 and 2 are bug fixes for regressions caused by using huge page
> backed vmalloc by default. So I think we do need it for 5.18. The
> other two do look like candidates for 5.19, though.

Thanks Luis and Christoph for your kind inputs on the set.

Here are my analysis after thinking about it overnight.

We can discuss the users of vmalloc in 4 categories: module_alloc, BPF
programs, alloc_large_system_hash, and others; and there are two archs
involved here: x86_64 and powerpc.

With whole set, the behavior is like:

| x86_64 | powerpc
--------------------------------------------+----------------------
module_alloc | use small pages
--------------------------------------------+----------------------
BPF programs | use 2MB pages | use small changes
--------------------------------------------+----------------------
large hash | use huge pages when size > PMD_SIZE
--------------------------------------------+----------------------
other-vmalloc | use small pages


Patch 1/4 fixes the behavior of module_alloc and other-vmalloc.
Without 1/4, both these users may get huge pages for size > PMD_SIZE
allocations, which may be troublesome([3] for example).

Patch 3/4 and 4/4, together with 1/1, allows BPF programs use 2MB
pages. This is the same behavior as before 5.18-rc1, which has been
tested in bpf-next and linux-next. Therefore, I don't think we need
to hold them until 5.19.

Patch 2/4 enables huge pages for large hash. Large hash has been
using huge pages on powerpc since 5.15. But this is new for x86_64.
If we ship 2/4, this is a performance improvement for x86_64, but
it is less tested on x86_64 (didn't go through linux-next). If we
ship 1/4 but not 2/4 with 5.18, we will see a small performance
regression for powerpc.

Based on this analysis, I think we should either
1) ship the whole set with 5.18; or
2) ship 1/4, 3/4, and 4/4 with 5.18, and 2/4 with 5.19.

With option 1), we enables huge pages for large hash on x86_64
without going through linux-next. With option 2), we take a small
performance regression with 5.18 on powerpc.

Of course, we can ship a hybrid solution by gating 2/4 for powerpc
only in 5.18, and enabling it for x86_64 in 5.19.

Does this make sense? Please let me know you comments and
suggestions on this.

Thanks,
Song

[3] https://lore.kernel.org/lkml/[email protected]/

2022-04-22 11:38:16

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

Excerpts from Christoph Hellwig's message of April 16, 2022 3:08 pm:
> On Fri, Apr 15, 2022 at 12:05:42PM -0700, Luis Chamberlain wrote:
>> Looks good except for that I think this should just wait for v5.19. The
>> fixes are so large I can't see why this needs to be rushed in other than
>> the first assumptions of the optimizations had some flaws addressed here.
>
> Patches 1 and 2 are bug fixes for regressions caused by using huge page
> backed vmalloc by default. So I think we do need it for 5.18.

No, the huge vmap patch should just be reverted because that caused
the regression, rather than adding another hack on top of it. All the
breakage is in arch/x86/, it doesn't make sense to change this code
and APIs outside x86 to work around it.

And once they are fixed these shouldn't be needed.

Thanks,
Nick

2022-04-25 04:04:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

[ I see that you posted a new version of the series, but I wasn't cc'd
on that one, so I'm replying to the old thread instead ]

On Sat, Apr 16, 2022 at 12:55 PM Song Liu <[email protected]> wrote:
>
> Patch 2/4 enables huge pages for large hash.

I decided that for 5.18, we want to at least fix the performance
regression on powerpc, so I've applied the 2/4 patch to enable huge
pages for the large hashes.

I also enabled them for kvmalloc(), since that seemed like the one
ObviouslySafe(tm) case of vmalloc use (famous last words, maybe I'll
be informed of somebody who still did odd protection games on the
result, but that really sounds invalid with the whole SLUB component).

I'm not touching the bpf parts. I think that's a 5.19 issue by now,
and since it's new, there's no equivalent performance regression
issue.

Linus

2022-04-26 07:04:12

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

Hi Linus,

> On Apr 24, 2022, at 10:43 AM, Linus Torvalds <[email protected]> wrote:
>
> [ I see that you posted a new version of the series, but I wasn't cc'd
> on that one, so I'm replying to the old thread instead ]

Thanks for fixing up these, and sorry for the messing up CC list.

>
> On Sat, Apr 16, 2022 at 12:55 PM Song Liu <[email protected]> wrote:
>>
>> Patch 2/4 enables huge pages for large hash.
>
> I decided that for 5.18, we want to at least fix the performance
> regression on powerpc, so I've applied the 2/4 patch to enable huge
> pages for the large hashes.
>
> I also enabled them for kvmalloc(), since that seemed like the one
> ObviouslySafe(tm) case of vmalloc use (famous last words, maybe I'll
> be informed of somebody who still did odd protection games on the
> result, but that really sounds invalid with the whole SLUB component).
>
> I'm not touching the bpf parts. I think that's a 5.19 issue by now,
> and since it's new, there's no equivalent performance regression
> issue.

With 5.18-rc4, bpf programs on x86_64 are using 4kB bpf_prog_pack.
So multiple small BPF programs will share a 4kB page. We still need
to initialize each 4kB bpf_prog_pack with illegal instructions, with
something similar to [1]. I will revise it based on Peter’s
suggestion [2].

Thanks,
Song



[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/