2022-08-22 17:05:39

by Song Liu

[permalink] [raw]
Subject: Re: [RFC 0/5] vmalloc_exec for modules and BPF programs



> On Aug 18, 2022, at 3:42 PM, Song Liu <[email protected]> wrote:
>
> This set is a prototype that allows dynamic kernel text (modules, bpf
> programs, various trampolines, etc.) to share huge pages. The idea is
> similar to Peter's suggestion in [1]. Please refer to each patch for
> more detais.
>
> The ultimate goal is to only host kernel text in 2MB pages (for x86_64).
>
> Please share your comments on this.
>
> Thanks!
>
> [1] https://lore.kernel.org/bpf/[email protected]/

Hi Peter,

Could you please share your feedback on this?

Thanks,
Song

PS: I guess vger dropped my patch again. :( The set is also available at

https://git.kernel.org/pub/scm/linux/kernel/git/song/linux.git

branch vmalloc_exec.

>
> Song Liu (5):
> vmalloc: introduce vmalloc_exec and vfree_exec
> bpf: use vmalloc_exec
> modules, x86: use vmalloc_exec for module core
> vmalloc_exec: share a huge page with kernel text
> vmalloc: vfree_exec: free unused vm_struct
>
> arch/x86/Kconfig | 1 +
> arch/x86/kernel/alternative.c | 30 ++++-
> arch/x86/kernel/module.c | 1 +
> arch/x86/mm/init_64.c | 3 +-
> include/linux/vmalloc.h | 16 +--
> kernel/bpf/core.c | 155 ++------------------------
> kernel/module/main.c | 23 ++--
> kernel/module/strict_rwx.c | 3 -
> kernel/trace/ftrace.c | 3 +-
> mm/nommu.c | 7 ++
> mm/vmalloc.c | 200 +++++++++++++++++++++++++++++-----
> 11 files changed, 239 insertions(+), 203 deletions(-)
>
> --
> 2.30.2


2022-08-22 17:07:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/5] vmalloc_exec for modules and BPF programs

On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
> Could you please share your feedback on this?

I've looked at it all of 5 minutes, so perhaps I've missed something.

However, I'm a little surprised you went with a second tree instead of
doing the top-down thing for data. The way you did it makes it hard to
have guard pages between text and data.

2022-08-22 17:09:33

by Song Liu

[permalink] [raw]
Subject: Re: [RFC 0/5] vmalloc_exec for modules and BPF programs



> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
>> Could you please share your feedback on this?
>
> I've looked at it all of 5 minutes, so perhaps I've missed something.
>
> However, I'm a little surprised you went with a second tree instead of
> doing the top-down thing for data. The way you did it makes it hard to
> have guard pages between text and data.

I didn't realize the importance of the guard pages. But it is not too
hard to do it with this approach. For each 2MB text page, we can reserve
4kB on the beginning and end of it. Would this work?

There are a couple benefits from a second tree:

1. It allows text allocations to go below PAGE_SIZE granularity, while
data allocations would still use PAGE_SIZE granularity, which is the
same as current code.
2. Text allocate requires mapping one vm_struct to many vmap_area. Putting
text allocations in a separate tree make it easier to handle this.
(Well, I haven't finished this logic yet).
3. A separate tree makes it easier to use text tail page,
[_etext, roundup(_etext, PMD_SIZE)], for modules and BPF programs.

Does this make sense? Do you see other downsides with a second tree?

Thanks,
Song

2022-08-23 06:15:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 0/5] vmalloc_exec for modules and BPF programs

On Mon, Aug 22, 2022 at 04:56:47PM +0000, Song Liu wrote:
>
>
> > On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
> >> Could you please share your feedback on this?
> >
> > I've looked at it all of 5 minutes, so perhaps I've missed something.
> >
> > However, I'm a little surprised you went with a second tree instead of
> > doing the top-down thing for data. The way you did it makes it hard to
> > have guard pages between text and data.
>
> I didn't realize the importance of the guard pages. But it is not too

I'm not sure how important it is, just seems like a good idea to trap
anybody trying to cross that divide. Also, to me it seems like a good
idea to have a single large contiguous text region instead of splintered
2M pages.

> hard to do it with this approach. For each 2MB text page, we can reserve
> 4kB on the beginning and end of it. Would this work?

Typically a guard page has different protections (as in none what so
ever) so that every access goes *splat*.

2022-08-23 06:57:59

by Christophe Leroy

[permalink] [raw]
Subject: Re: [RFC 0/5] vmalloc_exec for modules and BPF programs



Le 23/08/2022 à 07:42, Peter Zijlstra a écrit :
> On Mon, Aug 22, 2022 at 04:56:47PM +0000, Song Liu wrote:
>>
>>
>>> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
>>>> Could you please share your feedback on this?
>>>
>>> I've looked at it all of 5 minutes, so perhaps I've missed something.
>>>
>>> However, I'm a little surprised you went with a second tree instead of
>>> doing the top-down thing for data. The way you did it makes it hard to
>>> have guard pages between text and data.
>>
>> I didn't realize the importance of the guard pages. But it is not too
>
> I'm not sure how important it is, just seems like a good idea to trap
> anybody trying to cross that divide. Also, to me it seems like a good
> idea to have a single large contiguous text region instead of splintered
> 2M pages.
>
>> hard to do it with this approach. For each 2MB text page, we can reserve
>> 4kB on the beginning and end of it. Would this work?
>
> Typically a guard page has different protections (as in none what so
> ever) so that every access goes *splat*. >

Text is RO-X, on some architectures even only X. So the only real thing
to protect against is bad execution, isn't it ?. So I guess having some
areas with invalid or trap instructions would be enough ?

2022-08-23 07:15:43

by Song Liu

[permalink] [raw]
Subject: Re: [RFC 0/5] vmalloc_exec for modules and BPF programs



> On Aug 22, 2022, at 10:42 PM, Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Aug 22, 2022 at 04:56:47PM +0000, Song Liu wrote:
>>
>>
>>> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
>>>> Could you please share your feedback on this?
>>>
>>> I've looked at it all of 5 minutes, so perhaps I've missed something.
>>>
>>> However, I'm a little surprised you went with a second tree instead of
>>> doing the top-down thing for data. The way you did it makes it hard to
>>> have guard pages between text and data.
>>
>> I didn't realize the importance of the guard pages. But it is not too
>
> I'm not sure how important it is, just seems like a good idea to trap
> anybody trying to cross that divide. Also, to me it seems like a good
> idea to have a single large contiguous text region instead of splintered
> 2M pages.

A single large contiguous text region is great. However, it is not easy to
keep it contiguous. For example, when we load a big module, and then unload
it. It is not easy to recycle the space. Say we load module-x-v1, which is
4MB, and uses 2 huge pages. Then we load a small BPF program after it. The
address space looks like:

MODULE_VADDR to MODULE_VADDR + 4MB: module-x-v1
MODULE_VADDR + 4MB to MODULE_VADDR + 4MB + 4kB: bpf_prog_xxxx

When we unload module-x-v1, there will be 4MB hole in the address space.
If we then load module-x-v2, which is 4.1MB in size, we cannot reuse that
hole, because the module is a little too big for the hole.

AFAICT, to use the space efficiently, we will have to deal with splintered
2MB pages.

Does this make sense?

Thanks,
Song

>
>> hard to do it with this approach. For each 2MB text page, we can reserve
>> 4kB on the beginning and end of it. Would this work?
>
> Typically a guard page has different protections (as in none what so
> ever) so that every access goes *splat*.

2022-08-23 07:36:36

by Song Liu

[permalink] [raw]
Subject: Re: [RFC 0/5] vmalloc_exec for modules and BPF programs



> On Aug 22, 2022, at 11:39 PM, Christophe Leroy <[email protected]> wrote:
>
>
>
> Le 23/08/2022 à 07:42, Peter Zijlstra a écrit :
>> On Mon, Aug 22, 2022 at 04:56:47PM +0000, Song Liu wrote:
>>>
>>>
>>>> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <[email protected]> wrote:
>>>>
>>>> On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
>>>>> Could you please share your feedback on this?
>>>>
>>>> I've looked at it all of 5 minutes, so perhaps I've missed something.
>>>>
>>>> However, I'm a little surprised you went with a second tree instead of
>>>> doing the top-down thing for data. The way you did it makes it hard to
>>>> have guard pages between text and data.
>>>
>>> I didn't realize the importance of the guard pages. But it is not too
>>
>> I'm not sure how important it is, just seems like a good idea to trap
>> anybody trying to cross that divide. Also, to me it seems like a good
>> idea to have a single large contiguous text region instead of splintered
>> 2M pages.
>>
>>> hard to do it with this approach. For each 2MB text page, we can reserve
>>> 4kB on the beginning and end of it. Would this work?
>>
>> Typically a guard page has different protections (as in none what so
>> ever) so that every access goes *splat*. >
>
> Text is RO-X, on some architectures even only X. So the only real thing
> to protect against is bad execution, isn't it ?. So I guess having some
> areas with invalid or trap instructions would be enough ?

Agreed that filling with trap instructions should be enough.

Thanks,
Song

2022-08-24 17:13:52

by Song Liu

[permalink] [raw]
Subject: Re: [RFC 0/5] vmalloc_exec for modules and BPF programs

Hi Peter,

> On Aug 22, 2022, at 9:56 AM, Song Liu <[email protected]> wrote:
>
>> On Aug 22, 2022, at 9:34 AM, Peter Zijlstra <[email protected]> wrote:
>>
>> On Mon, Aug 22, 2022 at 03:46:38PM +0000, Song Liu wrote:
>>> Could you please share your feedback on this?
>>
>> I've looked at it all of 5 minutes, so perhaps I've missed something.
>>
>> However, I'm a little surprised you went with a second tree instead of
>> doing the top-down thing for data. The way you did it makes it hard to
>> have guard pages between text and data.
>
> I didn't realize the importance of the guard pages. But it is not too
> hard to do it with this approach. For each 2MB text page, we can reserve
> 4kB on the beginning and end of it. Would this work?
>
> There are a couple benefits from a second tree:
>
> 1. It allows text allocations to go below PAGE_SIZE granularity, while
> data allocations would still use PAGE_SIZE granularity, which is the
> same as current code.
> 2. Text allocate requires mapping one vm_struct to many vmap_area. Putting
> text allocations in a separate tree make it easier to handle this.
> (Well, I haven't finished this logic yet).
> 3. A separate tree makes it easier to use text tail page,
> [_etext, roundup(_etext, PMD_SIZE)], for modules and BPF programs.
>
> Does this make sense? Do you see other downsides with a second tree?

Did these make sense? Do you have future comments that I would address in
future versions?

Thanks,
Song