by Uros Bizjak

[permalink] [raw]

Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Wed, Oct 11, 2023 at 9:27 AM Uros Bizjak <[email protected]> wrote:

> > IOW, something like
> >
> > static __always_inline unsigned long new_cpu_offset(void)
> > {
> > unsigned long res;
> > asm(ALTERNATIVE(
> > "movq " __percpu_arg(1) ",%0",
> > "rdgsbase %0",
> > X86_FEATURE_FSGSBASE)
> > : "=r" (res)
> > : "m" (this_cpu_off));
> > return res;
> > }
> >
> > would presumably work together with your __seg_gs stuff.
>
> I have zero experience with rdgsbase insn, but the above is not
> dependent on __seg_gs, so (the movq part at least) would also work in
> the current mainline. To work together with __seg_gs stuff,
> this_cpu_offset should be enclosed in __my_cpu_var. Also, if rdgsbase
> is substituted with rdfsbase, it will also work for 32-bit targets.

In fact, rdgsbase is available only for 64-bit targets.

Uros.

2023-10-11 18:42:50

[permalink] [raw]

> On Oct 12, 2023, at 8:16 PM, Linus Torvalds <[email protected]> wrote:
>
> !! External Email
>
> On Thu, 12 Oct 2023 at 08:19, Nadav Amit <[email protected]> wrote:
>>
>> +/*
>> + * Hold a constant alias for current_task, which would allow to avoid caching of
>> + * current task.
>> + *
>> + * We must mark const_current_task with the segment qualifiers, as otherwise gcc
>> + * would do redundant reads of const_current_task.
>> + */
>> +DECLARE_PER_CPU(struct pcpu_hot const __percpu_seg_override, const_pcpu_hot);
>
> Hmm. The only things I'm not super-happy about with your patch is
>
> (a) it looks like this depends on the alias analysis knowing that the
> __seg_gs isn't affected by normal memory ops. That implies that this
> will not work well with compiler versions that don't do that?
>
> (b) This declaration doesn't match the other one. So now there are
> two *different* declarations for const_pcpu_hot, which I really don't
> like.
>
> That second one would seem to be trivial to just fix (or maybe not,
> and you do it that way for some horrible reason).

If you refer to the difference between DECLARE_PER_CPU_ALIGNED() and
DECLARE_PER_CPU() - that’s just a silly mistake that I made porting my
old patch (I also put “const” in the wrong place of the declaration, sorry).

>
> The first one sounds bad to me - basically making the *reason* for
> this patch go away - but maybe the compilers that don't support
> address spaces are so rare that we can ignore it.

As far as I understand it has nothing to do with the address spaces, and IIRC
the compiler does not regard gs/fs address spaces as independent from the main
one. That’s the reason a compiler barrier affects regular loads with __seg_gs.

The “trick” that the patch does is to expose a new const_pcpu_hot symbol that has
a “const” qualifier. For compilation units from which the symbol is effectively
constant, we use const_pcpu_hot. The compiler then knows that the value would not
change.

Later, when we actually define the const_pcpu_hot, we tell the compiler using
__attribute__((alias("pcpu_hot”)) that this symbol is actually an alias to pcpu_hot.

Although it is a bit of a trick that I have never seen elsewhere, I don’t see it
violating GCC specifications (“except for top-level qualifiers the alias target
must have the same type as the alias” [1]), and there is nothing that is specific
to the gs address-space. I still have the concern of its interaction with LTO
though, and perhaps using “-fno-lto” when compiling compilation units that
modify current (e.g., arch/x86/kernel/process_64.o) is necessary.

I hope it makes sense.

[1] https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html

Attachments:

signature.asc (849.00 B)
Message signed with OpenPGP

2023-10-12 19:41:53

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, 12 Oct 2023 at 12:32, Nadav Amit <[email protected]> wrote:
>
> If you refer to the difference between DECLARE_PER_CPU_ALIGNED() and
> DECLARE_PER_CPU() - that’s just a silly mistake that I made porting my
> old patch (I also put “const” in the wrong place of the declaration, sorry).

Yes, the only difference is the ALIGNED and the 'const', but I think
also the alias attribute.

However, I'd be happier if we had just one place that declares them, not two.

Even if the two were identical, it seems wrong to have two
declarations for the same thing.

> The “trick” that the patch does is to expose a new const_pcpu_hot symbol that has
> a “const” qualifier. For compilation units from which the symbol is effectively
> constant, we use const_pcpu_hot. The compiler then knows that the value would not
> change.

Oh, I don't disagree with that part.

I just don't see why the 'asm' version would have any difference. For
that too, the compiler should see that the result of the asm doesn't
change.

So my confusion / worry is not about the const alias. I like that part.

My worry is literally "in other situations we _have_ to use asm(), and
it's not clear why gcc wouldn't do as well for it".

Linus

2023-10-12 21:31:15

by Josh Poimboeuf

[permalink] [raw]

Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Thu, Oct 12, 2023 at 07:59:43PM +0200, Ingo Molnar wrote:
>
> * Josh Poimboeuf <[email protected]> wrote:
>
> > On Thu, Oct 12, 2023 at 08:19:14AM +0200, Ingo Molnar wrote:
> > >
> > > * Josh Poimboeuf <[email protected]> wrote:
> > >
> > > > Though, another problem is that .text has a crazy amount of padding
> > > > which makes it always the same size, due to the SRSO alias mitigation
> > > > alignment linker magic. We should fix that somehow.
> > >
> > > We could emit a non-aligned end-of-text symbol (we might have it already),
> > > and have a script or small .c program in scripts/ or tools/ that looks
> > > at vmlinux and displays a user-friendly and accurate list of text and
> > > data sizes in the kernel?
> > >
> > > And since objtool is technically an 'object files tool', and it already
> > > looks at sections & symbols, it could also grow a:
> > >
> > > objtool size <objfile>
> > >
> > > command that does the sane thing ... I'd definitely start using that, instead of 'size'.
> > >
> > > /me runs :-)
> >
> > Yeah, that's actually not a bad idea.
> >
> > I had been thinking a "simple" script would be fine, but I'm realizing
> > the scope of this thing could grow over time. In which case a script is
> > less than ideal. And objtool already has the ability to do this pretty
> > easily.
>
> Yeah, and speed actually matters here: I have scripts that generate object
> comparisons between commits, and every second of runtime counts - and a
> script would be slower and more fragile for something like allmodconfig
> builds or larger disto configs.

Ah, good to know.

> BTW., maybe the right objtool subcommand would be 'objtool sections', with
> an 'objtool sections size' sub-sub-command. Because I think this discussion
> shows that it would be good to have a bit of visibility into the sanity of
> our sections setup, with 'objtool sections check' for example doing a
> sanity check on whether there's anything extra in the text section that
> shouldn't be there? Or so ...

What would be an example of something "extra"? A sanity check might fit
better alongside the other checks already being done by the main objtool
"subcommand" which gets run by the kernel build.

BTW, I actually removed subcommands a while ago when I overhauled
objtool's interface to make it easier to combine options. That said,
I'm not opposed to re-adding them if we can find a sane way to do so.

Here's the current interface:

Usage: objtool <actions> [<options>] file.o

Actions:
-h, --hacks[=<jump_label,noinstr,skylake>]
patch toolchain bugs/limitations
-i, --ibt validate and annotate IBT
-l, --sls validate straight-line-speculation mitigations
-m, --mcount annotate mcount/fentry calls for ftrace
-n, --noinstr validate noinstr rules
-o, --orc generate ORC metadata
-r, --retpoline validate and annotate retpoline usage
-s, --stackval validate frame pointer rules
-t, --static-call annotate static calls
-u, --uaccess validate uaccess rules for SMAP
--cfi annotate kernel control flow integrity (kCFI) function preambles
--dump[=<orc>] dump metadata
--prefix <n> generate prefix symbols
--rethunk validate and annotate rethunk usage
--unret validate entry unret placement

Options:
-v, --verbose verbose warnings
--backtrace unwind on error
--backup create .orig files before modification
--dry-run don't write modifications
--link object is a linked object
--mnop nop out mcount call sites
--module object is part of a kernel module
--no-unreachable skip 'unreachable instruction' warnings
--sec-address print section addresses in warnings
--stats print statistics

Note how all the actions can be easily combined in a single execution
instance.

If we re-added subcommands, most of the existing functionality would be
part of a single subcommand. It used to be called "check", but it's no
longer a read-only operation so that's misleading. I'll call it "run"
for now.

Right now my preference would be to leave the existing interface as-is,
and then graft optional subcommands on top. If no subcommand is
specified then it would default to the "run" subcommand. It's a little
funky, but it would work well for the common case, where ~99% of the
functionality lives. And it doesn't break existing setups and
backports.

For example:

# current interface (no changes)
objtool --mcount --orc --retpoline --uaccess vmlinux.o

# same, with optional explicit "run" subcommand
objtool run --mcount --orc --retpoline --uaccess vmlinux.o

# new "size" subcommand
obtool size [options] vmlinux.o.before vmlinux.o.after

--
Josh

2023-10-13 09:40:09

On Fri, Oct 13, 2023 at 11:38 AM Uros Bizjak <[email protected]> wrote:
>
> On Thu, Oct 12, 2023 at 8:01 PM Uros Bizjak <[email protected]> wrote:
> >
> > On Thu, Oct 12, 2023 at 7:47 PM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > On Thu, 12 Oct 2023 at 10:10, Linus Torvalds
> > > <[email protected]> wrote:
> > > >
> > > > The fix seems to be a simple one-liner, ie just
> > > >
> > > > - asm(__pcpu_op2_##size(op, __percpu_arg(P[var]), "%[val]") \
> > > > + asm(__pcpu_op2_##size(op, __percpu_arg(a[var]), "%[val]") \
> > >
> > > Nope. That doesn't work at all.
> > >
> > > It turns out that we're not the only ones that didn't know about the
> > > 'a' modifier.
> > >
> > > clang has also never heard of it in this context, and the above
> > > one-liner results in an endless sea of errors, with
> > >
> > > error: invalid operand in inline asm: 'movq %gs:${1:a}, $0'
> > >
> > > Looking around, I think it's X86AsmPrinter::PrintAsmOperand() that is
> > > supposed to handle these things, and while it does have some handling
> > > for 'a', the comment around it says
> > >
> > > case 'a': // This is an address. Currently only 'i' and 'r' are expected.
> > >
> > > and I think our use ends up just confusing the heck out of clang. Of
> > > course, clang also does this:
> > >
> > > case 'P': // This is the operand of a call, treat specially.
> > > PrintPCRelImm(MI, OpNo, O);
> > > return false;
> > >
> > > so clang *already* generates those 'current' accesses as PCrelative, and I see
> > >
> > > movq %gs:pcpu_hot(%rip), %r13
> > >
> > > in the generated code.
> > >
> > > End result: clang actually generates what we want just using 'P', and
> > > the whole "P vs a" is only a gcc thing.

Maybe we should go with what Clang expects. %a with "i" constraint is
also what GCC handles, because

‘i’: An immediate integer operand (one with constant value) is
allowed. This includes symbolic constants whose values will be known
only at assembly time or later.

Attached patch patches both cases: the generated code for
mem_encrypt_identity.c does not change while the change in
percpu.h brings expected 4kB code size reduction. I think this is the
correct solution that will work for both compilers.

Uros.

Attachments:

memref-2.diff.txt (1.97 kB)

2023-10-13 16:39:18

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Fri, 13 Oct 2023 at 04:53, Uros Bizjak <[email protected]> wrote:
>
> Maybe we should go with what Clang expects. %a with "i" constraint is
> also what GCC handles, because
>
> ‘i’: An immediate integer operand (one with constant value) is
> allowed. This includes symbolic constants whose values will be known
> only at assembly time or later.

This looks fine to me, and would seem to be the simplest way to have
both gcc and clang be happy with things.

All these uses seem to be immediate addresses, as any actual dynamic
ones will use a proper "m" constraint (and no operand modifiers)

Linus

2023-10-16 18:53:19

On Mon, 16 Oct 2023 at 13:35, Nadav Amit <[email protected]> wrote:
>
> I have encountered several such issues before [1], and while some have been fixed,
> some have not (I looked at switch_fpu_finish()), and might under the right/wrong
> circumstances use the wrongly-“cached” current. Moreover, perhaps new problems
> have been added since my old patch.

Yeah, that fpu switching is disgusting and borderline buggy. And yes,
it would trigger problems when caching the value of 'current'.

I don't particularly love the patch you pointed at, because it seems
to have only fixed the switch_fpu_finish() case, which is the one that
presumably triggered issues, but that's not a very pretty fix.

switch_fpu_prepare() has the exact same problem, and in fact is likely
the *source* of the issue, because that's the original "load current"
that then ends up being cached incorrectly later in __switch_to().

The whole

struct fpu *prev_fpu = &prev->fpu;

thing in __switch_to() is pretty ugly. There's no reason why we should
look at that 'prev_fpu' pointer there, or pass it down.

And it only generates worse code, in how it loads 'current' when
t__switch_to() has the right task pointers.

So the attached patch is, I think, the right thing to do. It may not
be the *complete* fix, but at least for the config I tested, this
makes all loads of 'current' go away in the resulting generated
assembly, and the only access to '%gs:pcpu_hot(%rip)' is the write to
update it:

movq %rbx, %gs:pcpu_hot(%rip)

from that

raw_cpu_write(pcpu_hot.current_task, next_p);

code.

Thomas, I think you've touched this code last, but even that isn't
very recent. The attached patch not only cleans this up, it actually
generates better code too:

(a) it removes one push/pop pair at entry/exit because there's one
less register used (no 'current')

(b) it removes that pointless load of 'current' because it just uses
the right argument:

- #APP
- movq %gs:pcpu_hot(%rip), %r12
- #NO_APP
- testq $16384, (%r12)
+ testq $16384, (%rdi)

so I think this is the right thing to do. I checked that the 32-bit
code builds and looks sane too.

I do think the 'old/new' naming in the FPU code should probably be
'prev/next' to match the switch_to() naming, but I didn't do that.

Comments?

Linus

Linus

Attachments:

patch.diff (3.60 kB)

2023-10-16 23:15:37

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Mon, 16 Oct 2023 at 16:02, Linus Torvalds
<[email protected]> wrote:
>
> so I think this is the right thing to do. I checked that the 32-bit
> code builds and looks sane too.

Just to clarify: the 64-bit side I actually booted and am running.

The 32-bit side is pretty much identical, but I only checked that that
'process_32.c' file still builds. I didn't do any other testing.

Linus

2023-10-17 07:24:05

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

> On Oct 17, 2023, at 2:02 AM, Linus Torvalds <[email protected]> wrote:
>
> - #APP
> - movq %gs:pcpu_hot(%rip), %r12
> - #NO_APP
> - testq $16384, (%r12)
> + testq $16384, (%rdi)
>
> so I think this is the right thing to do. I checked that the 32-bit
> code builds and looks sane too.
>
> I do think the 'old/new' naming in the FPU code should probably be
> 'prev/next' to match the switch_to() naming, but I didn't do that.
>
> Comments?

Yes, the FPU issue is the one that caused me to crash before. I indeed missed
the switch_fpu_prepare(). The other issue that I encountered before, with
__resctrl_sched_in() has already been taken care of.

It would have been nice to somehow prevent such a thing from reoccurring.
Presumably objtool could have ensured it is so. But anyhow, I do not know of
any other currently open issues.

This whole thing (in addition to Uros’s analysis and objdump numbers) show
that the const-alias allows much more aggressive optimizations than the
current this_cpu_read_stable().

2023-10-17 19:01:37

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, 17 Oct 2023 at 00:23, Nadav Amit <[email protected]> wrote:
>
> Yes, the FPU issue is the one that caused me to crash before.

Uros, can you verify whether that patch of mine resolves the issue you saw?

That patch is _technically_ an actual bug-fix, although right now our
existing 'current' caching that depends on just CSE'ing the inline asm
(and is apparently limited to only doing so within single basic
blocks) doesn't actually trigger the bug in our __switch_to() logic in
practice.

Linus

2023-10-17 19:12:23

by Uros Bizjak

[permalink] [raw]

Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Tue, Oct 17, 2023 at 9:00 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, 17 Oct 2023 at 00:23, Nadav Amit <[email protected]> wrote:
> >
> > Yes, the FPU issue is the one that caused me to crash before.
>
> Uros, can you verify whether that patch of mine resolves the issue you saw?
>
> That patch is _technically_ an actual bug-fix, although right now our
> existing 'current' caching that depends on just CSE'ing the inline asm
> (and is apparently limited to only doing so within single basic
> blocks) doesn't actually trigger the bug in our __switch_to() logic in
> practice.

Unfortunately, it doesn't fix the oops :(

I'm testing your patch, together with the attached patch with the
current tip tree (that already has all necessary percpu stuff), and
get exactly the same oops in:

[ 4.969657] cfg80211: Loading compiled-in X.509 certificates for
regulatory database
[ 4.980712] modprobe (53) used greatest stack depth: 13480 bytes left
[ 4.981048] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 4.981830] #PF: supervisor write access in kernel mode
[ 4.981830] #PF: error_code(0x0002) - not-present page
[ 4.981830] PGD 0 P4D 0
[ 4.981830] Oops: 0002 [#1] PREEMPT SMP PTI
[ 4.981830] CPU: 1 PID: 54 Comm: kworker/u4:1 Not tainted
6.6.0-rc6-00406-g84ab57184ff4-dirty #2
[ 4.981830] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.2-1.fc37 04/01/2014
[ 4.981830] RIP: 0010:begin_new_exec+0x8f2/0xa30
[ 4.981830] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89
ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea
fa ff ff <f0>
41 ff 0c 24 0f 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 48 fb
[ 4.981830] RSP: 0000:ffffa505401f3d68 EFLAGS: 00010246
[ 4.981830] RAX: 0000000000000000 RBX: ffff89ed809e9f00 RCX: 0000000000000000
[ 4.981830] RDX: 0000000000000000 RSI: ffff89ed80e6c000 RDI: ffff89ed809ea718
[ 4.981830] RBP: ffff89ed8039ee00 R08: 00000000fffffffe R09: 00000000ffffffff
[ 4.981830] R10: 000001ffffffffff R11: 0000000000000001 R12: 0000000000000000
[ 4.981830] R13: 0000000000000000 R14: ffff89ed809ea718 R15: ffff89ed80e6c000
[ 4.981830] FS: 0000000000000000(0000) GS:ffff89ee24900000(0000)
knlGS:0000000000000000
[ 4.981830] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4.981830] CR2: 0000000000000000 CR3: 00000001003a0000 CR4: 00000000000406f0
[ 4.981830] Call Trace:
[ 4.981830] <TASK>
[ 4.981830] ? __die+0x1e/0x60
[ 4.981830] ? page_fault_oops+0x17b/0x470
[ 4.981830] ? search_module_extables+0x14/0x50
[ 4.981830] ? exc_page_fault+0x66/0x140
[ 4.981830] ? asm_exc_page_fault+0x26/0x30
[ 4.981830] ? begin_new_exec+0x8f2/0xa30
[ 4.981830] ? begin_new_exec+0x3ce/0xa30
[ 4.981830] ? load_elf_phdrs+0x67/0xb0
[ 4.981830] load_elf_binary+0x2bb/0x1770
[ 4.981830] ? __kernel_read+0x136/0x2d0
[ 4.981830] bprm_execve+0x277/0x630
[ 4.981830] kernel_execve+0x145/0x1a0
[ 4.981830] call_usermodehelper_exec_async+0xcb/0x180
[ 4.981830] ? __pfx_call_usermodehelper_exec_async+0x10/0x10
[ 4.981830] ret_from_fork+0x2f/0x50
[ 4.981830] ? __pfx_call_usermodehelper_exec_async+0x10/0x10
[ 4.981830] ret_from_fork_asm+0x1b/0x30
[ 4.981830] </TASK>
[ 4.981830] Modules linked in:
[ 4.981830] CR2: 0000000000000000
[ 5.052612] ---[ end trace 0000000000000000 ]---
[ 5.053833] RIP: 0010:begin_new_exec+0x8f2/0xa30
[ 5.055065] Code: 31 f6 e8 c1 49 f9 ff e9 3c fa ff ff 31 f6 4c 89
ef e8 b2 4a f9 ff e9 19 fa ff ff 31 f6 4c 89 ef e8 23 4a f9 ff e9 ea
fa ff ff <f0>
41 ff 0c 24 0f 85 55 fb ff ff 4c 89 e7 e8 4b 02 df ff e9 48 fb
[ 5.059476] RSP: 0000:ffffa505401f3d68 EFLAGS: 00010246
[ 5.060780] RAX: 0000000000000000 RBX: ffff89ed809e9f00 RCX: 0000000000000000
[ 5.062483] RDX: 0000000000000000 RSI: ffff89ed80e6c000 RDI: ffff89ed809ea718
[ 5.064190] RBP: ffff89ed8039ee00 R08: 00000000fffffffe R09: 00000000ffffffff
[ 5.065908] R10: 000001ffffffffff R11: 0000000000000001 R12: 0000000000000000
[ 5.067625] R13: 0000000000000000 R14: ffff89ed809ea718 R15: ffff89ed80e6c000
[ 5.069343] FS: 0000000000000000(0000) GS:ffff89ee24900000(0000)
knlGS:0000000000000000
[ 5.071313] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.072732] CR2: 0000000000000000 CR3: 00000001003a0000 CR4: 00000000000406f0
[ 5.074439] Kernel panic - not syncing: Fatal exception
[ 5.075028] Kernel Offset: 0xcc00000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Uros.

Attachments:

current.diff.txt (1.83 kB)

2023-10-17 21:06:40

On Wed, Oct 18, 2023 at 2:14 PM Uros Bizjak <[email protected]> wrote:
>
> On Wed, Oct 18, 2023 at 12:54 PM Nadav Amit <[email protected]> wrote:
> >
> >
> >
> > > On Oct 18, 2023, at 12:04 PM, Uros Bizjak <[email protected]> wrote:
> > >
> > > Solved.
> > >
> > > All that is needed is to patch cpu_init() from
> > > arch/x86/kernel/cpu/common.c with:
> > >
> > > --cut here--
> > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > > index b14fc8c1c953..61b6fcdf6937 100644
> > > --- a/arch/x86/kernel/cpu/common.c
> > > +++ b/arch/x86/kernel/cpu/common.c
> > > @@ -2228,7 +2232,7 @@ void cpu_init_exception_handling(void)
> > > */
> > > void cpu_init(void)
> > > {
> > > - struct task_struct *cur = current;
> > > + struct task_struct *cur = this_cpu_read_stable(pcpu_hot.current_task);
> > > int cpu = raw_smp_processor_id();
> >
> > Thanks for solving that, and sorry that I missed it.
> >
> > The reason I didn’t encounter it before is that in my original patch I created
> > a new compilation unit which only defined the alias.
> >
> > Since there might be additional problems (any “current” use in common.c is
> > dangerous, even in included files), I think that while there may be additional
> > solutions, defining the alias in a separate compilation unit - as I did before -
> > is the safest.
>
> What happens here can be illustrated with the following testcase:
>
> --cut here--
> int init_mm;
>
> struct task_struct
> {
> int *active_mm;
> };
>
> struct task_struct init_task;
>
> struct pcpu_hot
> {
> struct task_struct *current_task;
> };
>
> struct pcpu_hot pcpu_hot = { .current_task = &init_task };
>
> extern const struct pcpu_hot __seg_gs const_pcpu_hot
> __attribute__((alias("pcpu_hot")));
>
> void foo (void)
> {
> struct task_struct *cur = const_pcpu_hot.current_task;
>
> cur->active_mm = &init_mm;
> }
> --cut here--
>
> gcc -O2 -S:
>
> foo:
> movq $init_mm, init_task(%rip)
> ret
>
> Here, gcc optimizes the access to generic address space, which is
> allowed to, since *we set the alias to pcpu_hot*, which is in the
> generic address space. The compiler doesn't care that we actually
> want:
>
> foo:
> movq %gs:const_pcpu_hot(%rip), %rax
> movq $init_mm, (%rax)
>
> So yes, to prevent the optimization, we have to hide the alias in another TU.
>
> BTW: Clang creates:
>
> foo:
> movq %gs:pcpu_hot(%rip), %rax
> movq $init_mm, (%rax)
> retq
>
> It is a bit more conservative and retains the address space of the
> aliasing symbol.
>
> Looks like another case of underspecified functionality where both
> compilers differ. Luckily, both DTRT when aliases are hidden in
> another TU.

Attached is the prototype patch that works for me (together with
Linus' FPU switching patch).

Uros.

Attachments:

current.diff.txt (2.90 kB)

2023-10-18 14:46:41

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

> On Oct 18, 2023, at 4:15 PM, Uros Bizjak <[email protected]> wrote:
>
>>
>> Looks like another case of underspecified functionality where both
>> compilers differ. Luckily, both DTRT when aliases are hidden in
>> another TU.
>
> Attached is the prototype patch that works for me (together with
> Linus' FPU switching patch).

In general looks good. See some minor issues below.

> --- a/arch/x86/include/asm/current.h
> +++ b/arch/x86/include/asm/current.h
> @@ -36,10 +36,23 @@ static_assert(sizeof(struct pcpu_hot) == 64);
>
> DECLARE_PER_CPU_ALIGNED(struct pcpu_hot, pcpu_hot);
>
> +/*
> + *
> + */

Obviously some further comments to clarify why struct pcpu_hot is
defined in percpu-hot.c (the GCC manual says: "It is an error if
the alias target is not defined in the same translation unit as the
alias” which can be used as part of the explanation.)

> +DECLARE_PER_CPU_ALIGNED(const struct pcpu_hot __percpu_seg_override,
> + const_pcpu_hot);
> +
> +#ifdef CONFIG_USE_X86_SEG_SUPPORT
> +static __always_inline struct task_struct *get_current(void)
> +{
> + return const_pcpu_hot.current_task;
> +}
> +#else
> static __always_inline struct task_struct *get_current(void)
> {
> return this_cpu_read_stable(pcpu_hot.current_task);
> }
> +#endif

Please consider using IS_ENABLED() to avoid the ifdef’ry.

So this would turn to be:

static __always_inline struct task_struct *get_current(void)
{
if (IS_ENABLED(CONFIG_USE_X86_SEG_SUPPORT))
return const_pcpu_hot.current_task;

return this_cpu_read_stable(pcpu_hot.current_task);
}

2023-10-18 15:18:21

On Thu, Oct 19, 2023 at 11:09 AM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Oct 18, 2023 at 03:40:05PM -0700, Linus Torvalds wrote:
> > On Wed, 18 Oct 2023 at 14:40, Uros Bizjak <[email protected]> wrote:
> > >
> > > The ones in "raw" form are not IRQ safe and these are implemented
> > > without volatile qualifier.
> >
> > You are misreading it.
> >
> > Both *are* irq safe - on x86.
>
> Stronger, x86 arch code very much relies on them being NMI-safe. Which
> makes the generic implementation insufficient.
>
> They *must* be single RmW instructions on x86.

Maybe I should rephrase my quoted claim above:

"raw" versions are not needed to be IRQ safe [*].

[*] Memory arguments need to be stable, so IRQ and NMI handlers can
not change them outside of the critical section where the "raw"
version operates. When memory arguments are stable, the compiler can
omit reads (cache) the arguments, or re-reads them (rematerialize)
from memory. The atomicity of the operation is irrelevant in the "raw"
context, so the implementation of raw_percpu_xchg_op using
raw_cpu_read/write is OK in this context.

Uros.

2023-10-19 16:32:56

by Uros Bizjak

[permalink] [raw]

On Thu, Oct 19, 2023 at 9:07 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 19 Oct 2023 at 11:49, Linus Torvalds
> <[email protected]> wrote:
> >
> > Honestly, I've actually never seen gcc rematerialize anything at all.
> >
> > I really only started worrying about remat issues in a theoretical
> > sense, and because I feel it would be relatively *easy* to do for
> > something where the source is a load.
>
> .. I started looking around, since I actually have gcc sources around.
>
> At least lra-remat.cc explicitly says
>
> o no any memory (as access to memory is non-profitable)
>
> so if we could just *rely* on that, it would actually allow us to use
> memory ops without the volatile.
>
> That would be the best of all worlds, of course.

I have made an experiment and changed:

#define __raw_cpu_read(qual, pcp) \
({ \
- *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)); \
+ *(__my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)); \
})

#define __raw_cpu_write(qual, pcp, val) \
do { \
- *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val); \
+ *(__my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val); \
} while (0)

Basically, I removed "volatile" from read/write accessors. With all
new percpu patches in place the difference in all percpu accesses is:

Reference: 15990 accesses
Patched: 15976 accesses.

So, the difference is 14 fewer accesses. Waaay too low of a gain for a
potential pain.

The code size savings are:

text data bss dec hex filename
25476129 4389468 808452 30674049 1d40c81 vmlinux-new.o
25476021 4389444 808452 30673917 1d40bfd vmlinux-ref.o

So, 108 bytes for the default build.

Uros.

2023-10-20 08:09:04

by Uros Bizjak

[permalink] [raw]

Subject: Re: [PATCH v2 -tip] x86/percpu: Use C for arch_raw_cpu_ptr()

On Fri, Oct 20, 2023 at 12:39 AM Linus Torvalds
<[email protected]> wrote:
>
> Unrelated question to the gcc people (well, related in the way that
> this discussion made me *test* this).

Perhaps you should report this in the gcc bugzilla and move the
discussion there. This thread already has more than 100 messages...

Thanks,
Uros.

2023-11-20 09:39:27

by Uros Bizjak

[permalink] [raw]