2022-04-19 20:09:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Tue, Apr 19, 2022 at 11:42 AM Mike Rapoport <[email protected]> wrote:
>
> I'd say that bpf_prog_pack was a cure for symptoms and this project tries
> to address more general problem.
> But you are right, it'll take some time and won't land in 5.19.

Just to update people: I've just applied Song's [1/4] patch, which
means that the whole current hugepage vmalloc thing is effectively
disabled (because nothing opts in).

And I suspect that will be the status for 5.18, unless somebody comes
up with some very strong arguments for (re-)starting using huge pages.

Linus


2022-04-21 12:54:19

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

Excerpts from Linus Torvalds's message of April 20, 2022 5:20 am:
> On Tue, Apr 19, 2022 at 11:42 AM Mike Rapoport <[email protected]> wrote:
>>
>> I'd say that bpf_prog_pack was a cure for symptoms and this project tries
>> to address more general problem.
>> But you are right, it'll take some time and won't land in 5.19.
>
> Just to update people: I've just applied Song's [1/4] patch, which
> means that the whole current hugepage vmalloc thing is effectively
> disabled (because nothing opts in).
>
> And I suspect that will be the status for 5.18, unless somebody comes
> up with some very strong arguments for (re-)starting using huge pages.

Why not just revert fac54e2bfb5b ?

Thanks,
Nick

2022-04-21 16:27:54

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Tue, Apr 19, 2022 at 12:20:39PM -0700, Linus Torvalds wrote:
> On Tue, Apr 19, 2022 at 11:42 AM Mike Rapoport <[email protected]> wrote:
> >
> > I'd say that bpf_prog_pack was a cure for symptoms and this project tries
> > to address more general problem.
> > But you are right, it'll take some time and won't land in 5.19.
>
> Just to update people: I've just applied Song's [1/4] patch, which
> means that the whole current hugepage vmalloc thing is effectively
> disabled (because nothing opts in).
>
> And I suspect that will be the status for 5.18, unless somebody comes
> up with some very strong arguments for (re-)starting using huge pages.

Here is the quote from Song's cover letter for bpf_prog_pack series:

Most BPF programs are small, but they consume a page each. For systems
with busy traffic and many BPF programs, this could also add significant
pressure to instruction TLB. High iTLB pressure usually causes slow down
for the whole system, which includes visible performance degradation for
production workloads.

The last sentence is the key. We've added this feature not because of bpf
programs themselves. So calling this feature an optimization is not quite
correct. The number of bpf programs on the production server doesn't matter.
The programs come and go all the time. That is the key here. The 4k
module_alloc() plus set_memory_ro/x done by the JIT break down huge pages and
increase TLB pressure on the kernel code. That creates visible performance
degradation for normal user space workloads that are not doing anything bpf
related. mm folks can fill in the details here. My understanding it's
something to do with identity mapping.
So we're not trying to improve bpf performance. We're trying to make
sure that bpf program load/unload doesn't affect the speed of the kernel.
Generalizing bpf_prog_alloc to modules would be nice, but it's not clear
what benefits such optimization might have. It's orthogonal here.

So I argue that all 4 Song's fixes are necessary in 5.18.
We need an additional zeroing patch too, of course, to make sure huge page
doesn't have garbage at alloc time and it's cleaned after prog is unloaded.

Regarding JIT spraying and other concerns. Short answer: nothing changed.
JIT spraying was mitigated with start address randomization and invalid
instruction padding. Both features are still present.
Constant blinding is also fully functional.

Any kind of generalization of bpf_prog_pack into general mm feature would be
nice, but it cannot be done as opportunistic cache. We need a guarantee that
bpf prog/unload won't recreate the issue with kernel performance degradation. I
suspect we would need bpf_prog_pack in the current form for foreseeable future.

2022-04-21 22:32:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Wed, Apr 20, 2022 at 8:25 PM Nicholas Piggin <[email protected]> wrote:
>
> Why not just revert fac54e2bfb5b ?

That would be stupid, with no sane way forward.

The fact is, HUGE_VMALLOC was badly misdesigned, and enabling it on
x86 only ended up showing the problems.

It wasn't fac54e2bfb5b that was the fundamental issue. It was the
whole "oh, we should never have done it that way to begin with".

The whole initial notion that HAVE_ARCH_HUGE_VMALLOC means that there
must be no PAGE_SIZE pte assumptions was simply broken. There were
actual real cases that had those assumptions, and the whole "let's
just change vmalloc behavior by default and then people who don't like
it can opt out" was just fundamentally a really bad idea.

Power had that random "oh, we don't want to do this for module_alloc",
which you had a comment about "more testing" for.

And s390 had a case of hardware limitations where it didn't work for some cases.

And then enabling it on x86 turned up more issues.

So yes, commit fac54e2bfb5b _exposed_ things to a much larger
audience. But all it just made clear was that your original notion of
"let's change behavior and randomly disable it as things turn up" was
just broken.

Including "small" details like the fact that apparently
VM_FLUSH_RESET_PERMS didn't work correctly any more for this, which
caused issues for bpf, and that [PATCH 4/4]. And yes, there was a
half-arsed comment ("may require extra work") to that effect in the
powerpc __module_alloc() function, but it had been left to others to
notice separately.

So no. We're not going back to that completely broken model. The
lagepage thing needs to be opt-in, and needs a lot more care.

Linus

2022-04-22 18:34:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Wed, Apr 20, 2022 at 10:48 PM Linus Torvalds
<[email protected]> wrote:
>
> The lagepage thing needs to be opt-in, and needs a lot more care.

Side note: part of the opt-in really should be about the performance impact.

It clearly can be quite noticeable, as outlined by that powerpc case
in commit 8abddd968a30 ("powerpc/64s/radix: Enable huge vmalloc
mappings"), but it presumably is some _particular_ case that actually
matters.

But it's equalyl clearly not the module code/data case, since
__module_alloc() explicitly disables largepages on powerpc.

At a guess, it's one or more of the large hash-table allocations.

And it would actually be interesting to hear *which*one*. From the
'git diff' workload, I'd expect it to be the dentry lookup hash table
- I can't think of anything else that would be vmalloc'ed that would
be remotely interesting - but who knows.

So I think the whole "opt in" isn't _purely_ about the "oh, random
cases are broken for odd reasons, so let's not enable it by default".

I think it would actually be good to literally mark the cases that
matter (and have the performance numbers for those cases).

Linus

2022-04-22 19:16:57

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

Excerpts from Linus Torvalds's message of April 21, 2022 3:48 pm:
> On Wed, Apr 20, 2022 at 8:25 PM Nicholas Piggin <[email protected]> wrote:
>>
>> Why not just revert fac54e2bfb5b ?
>
> That would be stupid, with no sane way forward.

Oh I missed this comment. Now I'm completely confused. Reverting the
patch which caused the breakage *is* the sane way forward. Your tree
is not broken after that so you're done.

And if x86 wanted to select HUGE_VMALLOC in future then it can do so
after fixing the issues it has with it, so that's that the sane way
forward for that.

What you have now is what's insane. HAVE_ARCH_HUGE_VMALLOC now means
"you can ask for huge pages but it might crash in undocumented
arch-specific circumstances so good luck".

Thanks,
Nick

2022-04-22 19:23:48

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

Excerpts from Linus Torvalds's message of April 21, 2022 4:02 pm:
> On Wed, Apr 20, 2022 at 10:48 PM Linus Torvalds
> <[email protected]> wrote:
>>
>> The lagepage thing needs to be opt-in, and needs a lot more care.
>
> Side note: part of the opt-in really should be about the performance impact.
>
> It clearly can be quite noticeable, as outlined by that powerpc case
> in commit 8abddd968a30 ("powerpc/64s/radix: Enable huge vmalloc
> mappings"), but it presumably is some _particular_ case that actually
> matters.
>
> But it's equalyl clearly not the module code/data case, since
> __module_alloc() explicitly disables largepages on powerpc.
>
> At a guess, it's one or more of the large hash-table allocations.

The changelog is explicit it is the vfs hashes.

> And it would actually be interesting to hear *which*one*. From the
> 'git diff' workload, I'd expect it to be the dentry lookup hash table
> - I can't think of anything else that would be vmalloc'ed that would
> be remotely interesting - but who knows.

I didn't measure dentry/inode separately but it should mostly
(~entirely?) be the dentry hash, yes.

> So I think the whole "opt in" isn't _purely_ about the "oh, random
> cases are broken for odd reasons, so let's not enable it by default".

The whole concept is totally broken upstream now though. Core code
absolutely can not mark any allocation as able to use huge pages
because x86 is in some crazy half-working state. Can we use hugepage
dentry cache with x86 with hibernation? With BPF? Who knows.

> I think it would actually be good to literally mark the cases that
> matter (and have the performance numbers for those cases).

As per previous comment, not for correctness but possibly to help
guide some heuristic. I don't see it being too big a deal though,
a multi-MB vmalloc that can use hugepages probably wants to, quite
small downside (fragmentation being about the only one, but there
aren't a vast number of such allocations in the kernel to have
been noticed as yet).

Thanks,
Nick

2022-04-22 21:23:47

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

Excerpts from Linus Torvalds's message of April 21, 2022 3:48 pm:
> On Wed, Apr 20, 2022 at 8:25 PM Nicholas Piggin <[email protected]> wrote:
>>
>> Why not just revert fac54e2bfb5b ?
>
> That would be stupid, with no sane way forward.
>
> The fact is, HUGE_VMALLOC was badly misdesigned, and enabling it on
> x86 only ended up showing the problems.
>
> It wasn't fac54e2bfb5b that was the fundamental issue. It was the
> whole "oh, we should never have done it that way to begin with".
>
> The whole initial notion that HAVE_ARCH_HUGE_VMALLOC means that there
> must be no PAGE_SIZE pte assumptions was simply broken.

It didn't have that requirement so much as required it to be
accounted for if the arch enabled it.

> There were
> actual real cases that had those assumptions, and the whole "let's
> just change vmalloc behavior by default and then people who don't like
> it can opt out" was just fundamentally a really bad idea.
>
> Power had that random "oh, we don't want to do this for module_alloc",
> which you had a comment about "more testing" for.
>
> And s390 had a case of hardware limitations where it didn't work for some cases.
>
> And then enabling it on x86 turned up more issues.

Those were (AFAIKS) all in arch code though. The patch was the
fundamental issue for x86 because it had bugs. I don't quite see
what your objection is to power and s390's working implementations.
Some parts of the arch code could not cope with hue PTEs so they
used small.

Switching the API around to expect non-arch code to know whether or
not it can use huge mappings is much worse. How is
alloc_large_system_hash expected to know whether it may use huge
pages on any given half-broken arch like x86?

It's the same like we have huge iomap for a long time. No driver
should be expect to have to understand that.

> So yes, commit fac54e2bfb5b _exposed_ things to a much larger
> audience. But all it just made clear was that your original notion of
> "let's change behavior and randomly disable it as things turn up" was
> just broken.
>
> Including "small" details like the fact that apparently
> VM_FLUSH_RESET_PERMS didn't work correctly any more for this, which
> caused issues for bpf, and that [PATCH 4/4].

Which is another arch detail.

> And yes, there was a
> half-arsed comment ("may require extra work") to that effect in the
> powerpc __module_alloc() function, but it had been left to others to
> notice separately.

It had a comment in arch/Kconfig about it. Combing through the
details of every arch is left to others who choose to opt-in though.

> So no. We're not going back to that completely broken model. The
> lagepage thing needs to be opt-in, and needs a lot more care.

I don't think it should be opt-in at the caller level (at least not
outside arch/). As I said earlier maybe we end up finding fragmentation
to be a problem that can't be solved with simple heuristics tweaking
so we could think about adding something to give size/speed hint
tradeoff, but as for "can this caller use huge vmap backed memory",
it should not be something drivers or core code has to think about.

Thanks,
Nick