2022-04-17 01:20:43

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

Hi Linus,

Thanks a lot for your kind reply.

> On Apr 16, 2022, at 1:30 PM, Linus Torvalds <[email protected]> wrote:
>
> On Sat, Apr 16, 2022 at 12:55 PM Song Liu <[email protected]> wrote:
>>
>> Based on this analysis, I think we should either
>> 1) ship the whole set with 5.18; or
>> 2) ship 1/4, 3/4, and 4/4 with 5.18, and 2/4 with 5.19.
>
> Honestly, I think the proper thing to do is
>
> - apply #1, because yes, that "use huge pages" should be an opt-in.
>
> - but just disable hugepages for now.

Hmm.. This sure is an option...

>
> I think those games with set_memory_nx() and friends just show how
> rough this all is right now.
>
> In fact, I personally think that the whole bpf 'prog_pack' stuff
> should probably be disabled. It looks incredible broken to me right
> now.
>
> Unless I mis-read it, it does a "module_alloc()" to allocate the vmap
> area, and then just marks it executable without having even
> initialized the pages. Am I missing something? So now we have random
> kernel memory that is marked executable.
>
> Sure, it's also marked RO, but who cares? It's random data that is now
> executable.

While this is a serious issue (my fault), it is relatively easy to fix.
We can fill the whole space with invalid instructions when the page
is allocated and when a BPF program is freed. Both these operations are
on the slow paths, so the overhead is minimal.

>
> Maybe I am missing something, but I really don't think this is ready
> for prime-time. We should effectively disable it all, and have people
> think through it a lot more.

This has been discussed on lwn.net: https://lwn.net/Articles/883454/.
AFAICT, the biggest concern is whether reserving minimal 2MB for BPF
programs is a good trade-off for memory usage. This is again my fault
not to state the motivation clearly: the primary gain comes from less
page table fragmentation and thus better iTLB efficiency.

Other folks (in recent thread on this topic and offline in other
discussions) also showed strong interests in using similar technical
for text of kernel modules. So I would really like to learn your
opinion on this. There are many details we can optimize, but I guess
the general mechanism has to be something like:
- allocate a huge page, make it safe, and set it as executable;
- as users (BPF, kernel module, etc.) request memory for text, give
a chunk of the huge page to the user.
- use some mechanism to update the chunk of memory safely.

As you correctly noted, I missed the "make it safe" step (which
should be easy to fix). But the rest of the flow is more or less the
best solution to me.

Did I totally miss some better option here? Or is this really not a
viable option (IOW, bpf programs and kernel modules have to use
regular pages)?

Thanks again,
Song


2022-04-18 16:30:38

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

Hi,

On Sat, Apr 16, 2022 at 10:26:08PM +0000, Song Liu wrote:
> > On Apr 16, 2022, at 1:30 PM, Linus Torvalds <[email protected]> wrote:
> >
> > Maybe I am missing something, but I really don't think this is ready
> > for prime-time. We should effectively disable it all, and have people
> > think through it a lot more.
>
> This has been discussed on lwn.net: https://lwn.net/Articles/883454/.
> AFAICT, the biggest concern is whether reserving minimal 2MB for BPF
> programs is a good trade-off for memory usage. This is again my fault
> not to state the motivation clearly: the primary gain comes from less
> page table fragmentation and thus better iTLB efficiency.

Reserving 2MB pages for BPF programs will indeed reduce the fragmentation,
but OTOH it will reduce memory utilization. If for large systems this may
not be an issue, on smaller machines trading off memory for iTLB
performance may be not that obvious.

> Other folks (in recent thread on this topic and offline in other
> discussions) also showed strong interests in using similar technical
> for text of kernel modules. So I would really like to learn your
> opinion on this. There are many details we can optimize, but I guess
> the general mechanism has to be something like:
> - allocate a huge page, make it safe, and set it as executable;
> - as users (BPF, kernel module, etc.) request memory for text, give
> a chunk of the huge page to the user.
> - use some mechanism to update the chunk of memory safely.

There are use-cases that require 4K pages with non-default permissions in
the direct map and the pages not necessarily should be executable. There
were several suggestions to implement caches of 4K pages backed by 2M
pages.

I believe that "allocate huge page and split it to basic pages to hand out
to users" concept should be implemented at page allocator level and I
posted and RFC for this a while ago:

https://lore.kernel.org/all/[email protected]/

--
Sincerely yours,
Mike.

2022-04-21 03:58:05

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Mon, Apr 18, 2022 at 01:06:36PM +0300, Mike Rapoport wrote:
> Hi,
>
> On Sat, Apr 16, 2022 at 10:26:08PM +0000, Song Liu wrote:
> > > On Apr 16, 2022, at 1:30 PM, Linus Torvalds <[email protected]> wrote:
> > >
> > > Maybe I am missing something, but I really don't think this is ready
> > > for prime-time. We should effectively disable it all, and have people
> > > think through it a lot more.
> >
> > This has been discussed on lwn.net: https://lwn.net/Articles/883454/.
> > AFAICT, the biggest concern is whether reserving minimal 2MB for BPF
> > programs is a good trade-off for memory usage. This is again my fault
> > not to state the motivation clearly: the primary gain comes from less
> > page table fragmentation and thus better iTLB efficiency.
>
> Reserving 2MB pages for BPF programs will indeed reduce the fragmentation,
> but OTOH it will reduce memory utilization. If for large systems this may
> not be an issue, on smaller machines trading off memory for iTLB
> performance may be not that obvious.

So the current optimization at best should be a kconfig option?

> > Other folks (in recent thread on this topic and offline in other
> > discussions) also showed strong interests in using similar technical
> > for text of kernel modules. So I would really like to learn your
> > opinion on this. There are many details we can optimize, but I guess
> > the general mechanism has to be something like:
> > - allocate a huge page, make it safe, and set it as executable;
> > - as users (BPF, kernel module, etc.) request memory for text, give
> > a chunk of the huge page to the user.
> > - use some mechanism to update the chunk of memory safely.
>
> There are use-cases that require 4K pages with non-default permissions in
> the direct map and the pages not necessarily should be executable. There
> were several suggestions to implement caches of 4K pages backed by 2M
> pages.

Even if we just focus on the executable side of the story... there may
be users who can share this too.

I've gone down memory lane now at least down to year 2005 in kprobes
to see why the heck module_alloc() was used. At first glance there are
some old comments about being within the 2 GiB text kernel range... But
some old tribal knowledge is still lost. The real hints come from kprobe work
since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of line
- take2"), so that the "For the %rip-relative displacement fixups to be
doable"... but this got me wondering, would other users who *do* want
similar funcionality benefit from a cache. If the space is limited then
using a cache makes sense. Specially if architectures tend to require
hacks for some of this to all work.

Then, since it seems since the vmalloc area was not initialized,
wouldn't that break the old JIT spray fixes, refer to commit
314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying
attacks")?

Is that sort of work not needed anymore? If in doubt I at least made the
old proof of concept JIT spray stuff compile on recent kernels [0], but
I haven't tried out your patches yet. If this is not needed anymore,
why not?

The collection of tribal knowedge around these sorts of things would be
good to not loose and if we can share, even better.

> I believe that "allocate huge page and split it to basic pages to hand out
> to users" concept should be implemented at page allocator level and I
> posted and RFC for this a while ago:
>
> https://lore.kernel.org/all/[email protected]/

Neat, so although eBPF is a big user, are there some use cases outside
that immediately benefit?

[0] https://github.com/mcgrof/jit-spray-poc-for-ksp

LUis

2022-04-22 08:55:51

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Mon, Apr 18, 2022 at 05:44:19PM -0700, Luis Chamberlain wrote:
> On Mon, Apr 18, 2022 at 01:06:36PM +0300, Mike Rapoport wrote:
> > Hi,
> >
> > On Sat, Apr 16, 2022 at 10:26:08PM +0000, Song Liu wrote:
> > > > On Apr 16, 2022, at 1:30 PM, Linus Torvalds <[email protected]> wrote:
> > > >
> > > > Maybe I am missing something, but I really don't think this is ready
> > > > for prime-time. We should effectively disable it all, and have people
> > > > think through it a lot more.
> > >
> > > This has been discussed on lwn.net: https://lwn.net/Articles/883454/.
> > > AFAICT, the biggest concern is whether reserving minimal 2MB for BPF
> > > programs is a good trade-off for memory usage. This is again my fault
> > > not to state the motivation clearly: the primary gain comes from less
> > > page table fragmentation and thus better iTLB efficiency.
> >
> > Reserving 2MB pages for BPF programs will indeed reduce the fragmentation,
> > but OTOH it will reduce memory utilization. If for large systems this may
> > not be an issue, on smaller machines trading off memory for iTLB
> > performance may be not that obvious.
>
> So the current optimization at best should be a kconfig option?

Maybe not and it'll be fine on smaller systems, but from what I see the
bpf_prog_pack implementation didn't consider them.

And if we move the caches from BPF to vmalloc or page allocator that
would be much less of an issue.

> > I believe that "allocate huge page and split it to basic pages to hand out
> > to users" concept should be implemented at page allocator level and I
> > posted and RFC for this a while ago:
> >
> > https://lore.kernel.org/all/[email protected]/
>
> Neat, so although eBPF is a big user, are there some use cases outside
> that immediately benefit?

Anything that uses set_memory APIs could benefit from this. Except eBPF and
other module_alloc() users, there is secretmem that also fractures the
direct map and actually that was my initial use case for these patches.

Another possible use-case can be protection of page tables with PKS:

https://lore.kernel.org/lkml/[email protected]/

Vlastimil also mentioned that SEV-SNP could use such caching mechanism, but
I don't know the details.

> LUis

--
Sincerely yours,
Mike.