LinuxLists.cc - Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE

2022-04-19 11:15:21

Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote:
> > There are use-cases that require 4K pages with non-default
> > permissions in
> > the direct map and the pages not necessarily should be executable.
> > There
> > were several suggestions to implement caches of 4K pages backed by
> > 2M
> > pages.
>
> Even if we just focus on the executable side of the story... there
> may
> be users who can share this too.
>
> I've gone down memory lane now at least down to year 2005 in kprobes
> to see why the heck module_alloc() was used. At first glance there
> are
> some old comments about being within the 2 GiB text kernel range...
> But
> some old tribal knowledge is still lost. The real hints come from
> kprobe work
> since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of
> line
> - take2"), so that the "For the %rip-relative displacement fixups to
> be
> doable"... but this got me wondering, would other users who *do* want
> similar funcionality benefit from a cache. If the space is limited
> then
> using a cache makes sense. Specially if architectures tend to require
> hacks for some of this to all work.

Yea, that was my understanding. X86 modules have to be linked within
2GB of the kernel text, also eBPF x86 JIT generates code that expects
to be within 2GB of the kernel text.

I think of two types of caches we could have: caches of unmapped pages
on the direct map and caches of virtual memory mappings. Caches of
pages on the direct map reduce breakage of the large pages (and is
somewhat x86 specific problem). Caches of virtual memory mappings
reduce shootdowns, and are also required to share huge pages. I'll plug
my old RFC, where I tried to work towards enabling both:

https://lore.kernel.org/lkml/[email protected]/

Since then Mike has taken a lot further the direct map cache piece.

Yea, probably a lot of JIT's are way smaller than a page, but there is
also hopefully some performance benefit of reduced ITLB pressure and
TLB shootdowns. I think kprobes/ftrace (or at least one of them) keeps
its own cache of a page for putting very small trampolines.

>
> Then, since it seems since the vmalloc area was not initialized,
> wouldn't that break the old JIT spray fixes, refer to commit
> 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying
> attacks")?

Hmm, yea it might be a way to get around the ebpf jit rlimit. The
allocator could just text_poke() invalid instructions on "free" of the
jit.

>
> Is that sort of work not needed anymore? If in doubt I at least made
> the
> old proof of concept JIT spray stuff compile on recent kernels [0],
> but
> I haven't tried out your patches yet. If this is not needed anymore,
> why not?

IIRC this got addressed in two ways, randomizing of the jit offset
inside the vmalloc allocation, and "constant blinding", such that the
specific attack of inserting unaligned instructions as immediate
instruction data did not work. Neither of those mitigations seem
unworkable with a large page caching allocator.

>
> The collection of tribal knowedge around these sorts of things would
> be
> good to not loose and if we can share, even better.

Totally agree here. I think the abstraction I was exploring in that RFC
could remove some of the special permission memory tribal knowledge
that is lurking in in the cross-arch module.c. I wonder if you have any
thoughts on something like that? The normal modules proved the hardest.

2022-04-21 07:36:08

by Song Liu

[permalink] [raw]

Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

Hi Mike, Luis, and Rick,

Thanks for sharing your work and findings in the space. I didn't
realize we were looking at the same set of problems.

> On Apr 18, 2022, at 6:56 PM, Edgecombe, Rick P <[email protected]> wrote:
>
> On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote:
>>> There are use-cases that require 4K pages with non-default
>>> permissions in
>>> the direct map and the pages not necessarily should be executable.
>>> There
>>> were several suggestions to implement caches of 4K pages backed by
>>> 2M
>>> pages.
>>
>> Even if we just focus on the executable side of the story... there
>> may
>> be users who can share this too.
>>
>> I've gone down memory lane now at least down to year 2005 in kprobes
>> to see why the heck module_alloc() was used. At first glance there
>> are
>> some old comments about being within the 2 GiB text kernel range...
>> But
>> some old tribal knowledge is still lost. The real hints come from
>> kprobe work
>> since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of
>> line
>> - take2"), so that the "For the %rip-relative displacement fixups to
>> be
>> doable"... but this got me wondering, would other users who *do* want
>> similar funcionality benefit from a cache. If the space is limited
>> then
>> using a cache makes sense. Specially if architectures tend to require
>> hacks for some of this to all work.
>
> Yea, that was my understanding. X86 modules have to be linked within
> 2GB of the kernel text, also eBPF x86 JIT generates code that expects
> to be within 2GB of the kernel text.
>
>
> I think of two types of caches we could have: caches of unmapped pages
> on the direct map and caches of virtual memory mappings. Caches of
> pages on the direct map reduce breakage of the large pages (and is
> somewhat x86 specific problem). Caches of virtual memory mappings
> reduce shootdowns, and are also required to share huge pages. I'll plug
> my old RFC, where I tried to work towards enabling both:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Since then Mike has taken a lot further the direct map cache piece.

These are really interesting work. With this landed, we won't need
the bpf_prog_pack work at all (I think). OTOH, this looks like a
long term project, as some of the work in bpf_prog_pack took quite
some time to discuss/debate, and that was just a subset of the
whole thing.

I really like the two types of cache concept. But there are some
details I cannot figure out about them:

1. Is "caches of unmapped pages on direct map" (cache #1)
sufficient to fix all direct map fragmentation? IIUC, pages in
the cache may still be used by other allocation (with some
memory pressure). If the system runs for long enough, there
may be a lot of direct map fragmentation. Is this right?
2. If we have "cache of virtual memory mappings" (cache #2), do we
still need cache #1? I know cache #2 alone may waste some
memory, but I still think 2MB within noise for modern systems.
3. If we do need both caches, what would be the right APIs?

Thanks,
Song

> Yea, probably a lot of JIT's are way smaller than a page, but there is
> also hopefully some performance benefit of reduced ITLB pressure and
> TLB shootdowns. I think kprobes/ftrace (or at least one of them) keeps
> its own cache of a page for putting very small trampolines.
>
>>
>> Then, since it seems since the vmalloc area was not initialized,
>> wouldn't that break the old JIT spray fixes, refer to commit
>> 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying
>> attacks")?
>
> Hmm, yea it might be a way to get around the ebpf jit rlimit. The
> allocator could just text_poke() invalid instructions on "free" of the
> jit.
>
>>
>> Is that sort of work not needed anymore? If in doubt I at least made
>> the
>> old proof of concept JIT spray stuff compile on recent kernels [0],
>> but
>> I haven't tried out your patches yet. If this is not needed anymore,
>> why not?
>
> IIRC this got addressed in two ways, randomizing of the jit offset
> inside the vmalloc allocation, and "constant blinding", such that the
> specific attack of inserting unaligned instructions as immediate
> instruction data did not work. Neither of those mitigations seem
> unworkable with a large page caching allocator.
>
>>
>> The collection of tribal knowedge around these sorts of things would
>> be
>> good to not loose and if we can share, even better.
>
> Totally agree here. I think the abstraction I was exploring in that RFC
> could remove some of the special permission memory tribal knowledge
> that is lurking in in the cross-arch module.c. I wonder if you have any
> thoughts on something like that? The normal modules proved the hardest.
>

2022-04-22 05:55:16

by Luis Chamberlain

[permalink] [raw]

Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Tue, Apr 19, 2022 at 01:56:03AM +0000, Edgecombe, Rick P wrote:
> Yea, that was my understanding. X86 modules have to be linked within
> 2GB of the kernel text, also eBPF x86 JIT generates code that expects
> to be within 2GB of the kernel text.

And kprobes / live patching / ftrace.

Another architectural fun fact, powerpc book3s/32 requires executability
to be set per 256 Mbytes segments. Some architectures like this one
will want to also optimize how they use the module alloc area.

Even though today the use cases might be limited, we don't exactly know
how much memory a target device has a well, and so treating memory
failures for "special memory" request as regular memory failures seems
a bit odd, and users could get confused. For instance slapping on
extra memory on a system won't resolve any issues if the limit for a
special type of memory is already hit. Very likely not a problem at all today,
given how small modules / eBPF jit programs are / etc, but conceptually it
would seem wrong to just say -ENOMEM when in fact it's a special type of
required memory which cannot be allocated and the issue cannot possibly be
fixed. I don't think we have an option but to use -ENOMEM but at least
hinting of the special failure would have seem desirable.

Do we have other type of architectural limitations for "special memory"
other than executable? Do we have *new* types of special memory we
should consider which might be similar / limited in nature? And can / could /
should these architectural limitations hopefully be disappear in newer CPUs?
I see vmalloc_pks() as you pointed out [0] . Anything else?

> I think of two types of caches we could have: caches of unmapped pages
> on the direct map and caches of virtual memory mappings. Caches of
> pages on the direct map reduce breakage of the large pages (and is
> somewhat x86 specific problem). Caches of virtual memory mappings
> reduce shootdowns, and are also required to share huge pages. I'll plug
> my old RFC, where I tried to work towards enabling both:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Since then Mike has taken a lot further the direct map cache piece.
>
> Yea, probably a lot of JIT's are way smaller than a page, but there is
> also hopefully some performance benefit of reduced ITLB pressure and
> TLB shootdowns. I think kprobes/ftrace (or at least one of them) keeps
> its own cache of a page for putting very small trampolines.

The reason I looked into *why* module_alloc() was used was particularly
because it seemed a bit odd to have such ITLB enhancements for such
a niche use case and we couldn't have desired this elsewhere before.

> > Then, since it seems since the vmalloc area was not initialized,
> > wouldn't that break the old JIT spray fixes, refer to commit
> > 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying
> > attacks")?
>
> Hmm, yea it might be a way to get around the ebpf jit rlimit. The
> allocator could just text_poke() invalid instructions on "free" of the
> jit.
>
> >
> > Is that sort of work not needed anymore? If in doubt I at least made
> > the
> > old proof of concept JIT spray stuff compile on recent kernels [0],
> > but
> > I haven't tried out your patches yet. If this is not needed anymore,
> > why not?
>
> IIRC this got addressed in two ways, randomizing of the jit offset
> inside the vmalloc allocation, and "constant blinding", such that the
> specific attack of inserting unaligned instructions as immediate
> instruction data did not work. Neither of those mitigations seem
> unworkable with a large page caching allocator.

Got it, but was it *also* considerd in the fixes posted recently?

> > The collection of tribal knowedge around these sorts of things would
> > be
> > good to not loose and if we can share, even better.
>
> Totally agree here. I think the abstraction I was exploring in that RFC
> could remove some of the special permission memory tribal knowledge
> that is lurking in in the cross-arch module.c. I wonder if you have any
> thoughts on something like that? The normal modules proved the hardest.

Yeah modules will be harder now with the new ARCH_WANTS_MODULES_DATA_IN_VMALLOC
which Christophe Leroy added (queued in my modules-next). At a quick
glance it seems like an API in the right direction, but you just need
more architecture folks other than the usual x86 suspects to review.

Perhaps time for a new spin?

[0] https://lore.kernel.org/lkml/[email protected]/

Luis

2022-04-22 20:49:37

by Mike Rapoport

[permalink] [raw]

Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

Hi,

On Tue, Apr 19, 2022 at 05:36:45AM +0000, Song Liu wrote:
> Hi Mike, Luis, and Rick,
>
> Thanks for sharing your work and findings in the space. I didn't
> realize we were looking at the same set of problems.
>
> > On Apr 18, 2022, at 6:56 PM, Edgecombe, Rick P <[email protected]> wrote:
> >
> > On Mon, 2022-04-18 at 17:44 -0700, Luis Chamberlain wrote:
> >>> There are use-cases that require 4K pages with non-default
> >>> permissions in
> >>> the direct map and the pages not necessarily should be executable.
> >>> There
> >>> were several suggestions to implement caches of 4K pages backed by
> >>> 2M
> >>> pages.
> >>
> >> Even if we just focus on the executable side of the story... there
> >> may
> >> be users who can share this too.
> >>
> >> I've gone down memory lane now at least down to year 2005 in kprobes
> >> to see why the heck module_alloc() was used. At first glance there
> >> are
> >> some old comments about being within the 2 GiB text kernel range...
> >> But
> >> some old tribal knowledge is still lost. The real hints come from
> >> kprobe work
> >> since commit 9ec4b1f356b3 ("[PATCH] kprobes: fix single-step out of
> >> line
> >> - take2"), so that the "For the %rip-relative displacement fixups to
> >> be
> >> doable"... but this got me wondering, would other users who *do* want
> >> similar funcionality benefit from a cache. If the space is limited
> >> then
> >> using a cache makes sense. Specially if architectures tend to require
> >> hacks for some of this to all work.
> >
> > Yea, that was my understanding. X86 modules have to be linked within
> > 2GB of the kernel text, also eBPF x86 JIT generates code that expects
> > to be within 2GB of the kernel text.
> >
> >
> > I think of two types of caches we could have: caches of unmapped pages
> > on the direct map and caches of virtual memory mappings. Caches of
> > pages on the direct map reduce breakage of the large pages (and is
> > somewhat x86 specific problem). Caches of virtual memory mappings
> > reduce shootdowns, and are also required to share huge pages. I'll plug
> > my old RFC, where I tried to work towards enabling both:
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > Since then Mike has taken a lot further the direct map cache piece.
>
> These are really interesting work. With this landed, we won't need
> the bpf_prog_pack work at all (I think). OTOH, this looks like a
> long term project, as some of the work in bpf_prog_pack took quite
> some time to discuss/debate, and that was just a subset of the
> whole thing.

I'd say that bpf_prog_pack was a cure for symptoms and this project tries
to address more general problem.
But you are right, it'll take some time and won't land in 5.19.

> I really like the two types of cache concept. But there are some
> details I cannot figure out about them:

After some discussions we decided to try moving the caching of large pages
to the page allocator and see if the second cache will be needed at all.
But I've got distracted after posting the RFC and that work didn't have
real progress since then.

> 1. Is "caches of unmapped pages on direct map" (cache #1)
> sufficient to fix all direct map fragmentation? IIUC, pages in
> the cache may still be used by other allocation (with some
> memory pressure). If the system runs for long enough, there
> may be a lot of direct map fragmentation. Is this right?

If the system runs long enough, it may run out of high-order free pages
regardless of the way the caches are implemented. Then we either fail the
allocation because it is impossible to refill the cache with large pages or
fall back to 4k pages and fragment direct map.

I don't see how can we avoid direct map fragmentation entirely and still be
able to allocate memory for users of set_memory APIs.

> 2. If we have "cache of virtual memory mappings" (cache #2), do we
> still need cache #1? I know cache #2 alone may waste some
> memory, but I still think 2MB within noise for modern systems.

I presume that by cache #1 you mean the cache in the page allocator. In
that case cache #2 is probably not needed at all, because the cache at page
allocator level will be used by vmalloc() and friends to provide what Rick
called "permissioned allocations".

> Thanks,
> Song

--
Sincerely yours,
Mike.

2022-04-22 22:16:51

by Edgecombe, Rick P

[permalink] [raw]

Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Tue, 2022-04-19 at 14:24 -0700, Luis Chamberlain wrote:
> On Tue, Apr 19, 2022 at 01:56:03AM +0000, Edgecombe, Rick P wrote:
> > Yea, that was my understanding. X86 modules have to be linked
> > within
> > 2GB of the kernel text, also eBPF x86 JIT generates code that
> > expects
> > to be within 2GB of the kernel text.
>
> And kprobes / live patching / ftrace.
>
> Another architectural fun fact, powerpc book3s/32 requires
> executability
> to be set per 256 Mbytes segments. Some architectures like this one
> will want to also optimize how they use the module alloc area.
>
> Even though today the use cases might be limited, we don't exactly
> know
> how much memory a target device has a well, and so treating memory
> failures for "special memory" request as regular memory failures
> seems
> a bit odd, and users could get confused. For instance slapping on
> extra memory on a system won't resolve any issues if the limit for a
> special type of memory is already hit. Very likely not a problem at
> all today,
> given how small modules / eBPF jit programs are / etc, but
> conceptually it
> would seem wrong to just say -ENOMEM when in fact it's a special type
> of
> required memory which cannot be allocated and the issue cannot
> possibly be
> fixed. I don't think we have an option but to use -ENOMEM but at
> least
> hinting of the special failure would have seem desirable.

ENOMEM doesn't always mean out of physical memory though right? Could
be hitting some other memory limit. Not sure where this discussion of
the error code is coming from though.

As far as the problem of eating a whole 2MB on small systems, this
makes sense to me to worry about. Erratas limit what we can do here
with swapping page sizes on the fly. Probably a sensible solution would
be to decide whether to try based on system properties like boot memory
size.

Even without 2MB pages though, there are still improvements from these
types of changes.

>
> Do we have other type of architectural limitations for "special
> memory"
> other than executable? Do we have *new* types of special memory we
> should consider which might be similar / limited in nature? And can /
> could /
> should these architectural limitations hopefully be disappear in
> newer CPUs?
> I see vmalloc_pks() as you pointed out [0] . Anything else?

Hmm, shadow stack permission memory could pop up in vmalloc someday.

Not sure what you mean by architectural limitations. The relative
addressing? If so, those other usages shouldn't be restricted by that.

>
> > I think of two types of caches we could have: caches of unmapped
> > pages
> > on the direct map and caches of virtual memory mappings. Caches of
> > pages on the direct map reduce breakage of the large pages (and is
> > somewhat x86 specific problem). Caches of virtual memory mappings
> > reduce shootdowns, and are also required to share huge pages. I'll
> > plug
> > my old RFC, where I tried to work towards enabling both:
> >
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > Since then Mike has taken a lot further the direct map cache piece.
> >
> > Yea, probably a lot of JIT's are way smaller than a page, but there
> > is
> > also hopefully some performance benefit of reduced ITLB pressure
> > and
> > TLB shootdowns. I think kprobes/ftrace (or at least one of them)
> > keeps
> > its own cache of a page for putting very small trampolines.
>
> The reason I looked into *why* module_alloc() was used was
> particularly
> because it seemed a bit odd to have such ITLB enhancements for such
> a niche use case and we couldn't have desired this elsewhere before.

I think in general it is the only cross-arch way to get an allocation
in the arch's executable compatible area (which some need as you say).
module_alloc() is probably misnamed at this point with so many users
that are not modules.

>
> > > Then, since it seems since the vmalloc area was not initialized,
> > > wouldn't that break the old JIT spray fixes, refer to commit
> > > 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against
> > > spraying
> > > attacks")?
> >
> > Hmm, yea it might be a way to get around the ebpf jit rlimit. The
> > allocator could just text_poke() invalid instructions on "free" of
> > the
> > jit.
> >
> > >
> > > Is that sort of work not needed anymore? If in doubt I at least
> > > made
> > > the
> > > old proof of concept JIT spray stuff compile on recent kernels
> > > [0],
> > > but
> > > I haven't tried out your patches yet. If this is not needed
> > > anymore,
> > > why not?
> >
> > IIRC this got addressed in two ways, randomizing of the jit offset
> > inside the vmalloc allocation, and "constant blinding", such that
> > the
> > specific attack of inserting unaligned instructions as immediate
> > instruction data did not work. Neither of those mitigations seem
> > unworkable with a large page caching allocator.
>
> Got it, but was it *also* considerd in the fixes posted recently?

I didn't read any discussion about it. But if it wasn't clear, I'm just
an onlooker on bpf_prog_pack. I didn't see it until it was already
upstream. Maybe Song can say.

>
> > > The collection of tribal knowedge around these sorts of things
> > > would
> > > be
> > > good to not loose and if we can share, even better.
> >
> > Totally agree here. I think the abstraction I was exploring in that
> > RFC
> > could remove some of the special permission memory tribal knowledge
> > that is lurking in in the cross-arch module.c. I wonder if you have
> > any
> > thoughts on something like that? The normal modules proved the
> > hardest.
>
> Yeah modules will be harder now with the new
> ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> which Christophe Leroy added (queued in my modules-next).

Part of that work was separating out each module into 4 allocations, so
it might make it easier.

> At a quick
> glance it seems like an API in the right direction, but you just need
> more architecture folks other than the usual x86 suspects to review.
>
> Perhaps time for a new spin?

I would very much like to, but I am currently too busy with another
project. As such, I am mostly just trying to contribute ideas and my
personal collection of the hidden knowledge. It sounded like from Song,
someone might want to tackle it before I can get back to it.

And, yes other arch's review would be critical to making sure it
actually is a better interface and not just another one.

>
> [0]
>
https://lore.kernel.org/lkml/[email protected]/
>
> Luis