2020-04-29 09:50:56

by Steven Rostedt

[permalink] [raw]
Subject: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

From: Steven Rostedt (VMware) <[email protected]>

Tzvetomir was adding a feature to trace-cmd that would allow the user
to specify filtering on process IDs within a tracing instance (or
buffer). When he added this feature and tested it on tracing PIDs 1 and
2, it caused his kernel to hang.

He sent me his code and I was able to reproduce the hang as well. I
bisected it down to this commit 763802b53a42 ("x86/mm: split
vmalloc_sync_all()"). It was 100% reproducible. With the commit it
would hang, and reverting the commit, it would work.

Adding a bunch of printk()s, I found where it locked up. It was after
the recording was finished, and a write of "0" to
tracefs/instance/foo/events/enable. And in the code, it was:

(you may skip to the end of the chain)

system_enable_write() {
__ftrace_set_clr_event() {
__ftrace_set_clr_event_nolock() {
ftrace_event_enable_disable() {
__ftrace_event_enable_disable() {
call->class->reg() <trace_event_reg()> {
trace_point_probe_unregister() {
tracepoint_remove_func() {
static_key_slow_dec() {
__static_key_slow_dec() {

<continued>

__static_key_slow_dec_cpus_locked() {
jump_label_update() {
__jump_label_update()
arch_jump_label_transform() {
jump_label_transform() {
__jump_label_transform() {
text_poke_bp() {
text_poke_bp_batch() {
text_poke() {
__text_poke() {

<continued> (This is where you want to see)

use_temporary_mm() {
switch_mm_irqs_off() {
load_new_mm_cr3() {
write_cr3() <<--- Lock up!

The really strange part about this, is that this only crashes if we
filter the PIDs on an instance (and via trace-cmd, which does some
initial clean ups). But it works fine when doing from the top level
tracing buffer, or by doing this manually. I'm not sure how vmalloc
gets involved with this. Anyway, I tried the following patch, and it
fixes the lockup for both myself and Tzvetomir. But since I'm not 100%
sure what broke, I'm sending this out as an RFC.

Fixes: 763802b53a42 ("x86/mm: split vmalloc_sync_all()")
Reported-by: Tzvetomir Stoyanov (VMware) <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
---
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 7867dfb3963e..015dd30a2260 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -802,6 +802,8 @@ static void *__text_poke(void *addr, const void *opcode, size_t len)
*/
BUG_ON(!after_bootmem);

+ vmalloc_sync_mappings();
+
if (!core_kernel_text((unsigned long)addr)) {
pages[0] = vmalloc_to_page(addr);
if (cross_page_boundary)


2020-04-29 11:04:42

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

Hi Steven,

On Wed, Apr 29, 2020 at 05:48:57AM -0400, Steven Rostedt wrote:
> From: Steven Rostedt (VMware) <[email protected]>
>
> Tzvetomir was adding a feature to trace-cmd that would allow the user
> to specify filtering on process IDs within a tracing instance (or
> buffer). When he added this feature and tested it on tracing PIDs 1 and
> 2, it caused his kernel to hang.
>
> He sent me his code and I was able to reproduce the hang as well. I
> bisected it down to this commit 763802b53a42 ("x86/mm: split
> vmalloc_sync_all()"). It was 100% reproducible. With the commit it
> would hang, and reverting the commit, it would work.
>
> Adding a bunch of printk()s, I found where it locked up. It was after
> the recording was finished, and a write of "0" to
> tracefs/instance/foo/events/enable. And in the code, it was:
>
> (you may skip to the end of the chain)
>
> system_enable_write() {
> __ftrace_set_clr_event() {
> __ftrace_set_clr_event_nolock() {
> ftrace_event_enable_disable() {
> __ftrace_event_enable_disable() {
> call->class->reg() <trace_event_reg()> {
> trace_point_probe_unregister() {
> tracepoint_remove_func() {
> static_key_slow_dec() {
> __static_key_slow_dec() {
>
> <continued>
>
> __static_key_slow_dec_cpus_locked() {
> jump_label_update() {
> __jump_label_update()
> arch_jump_label_transform() {
> jump_label_transform() {
> __jump_label_transform() {
> text_poke_bp() {
> text_poke_bp_batch() {
> text_poke() {
> __text_poke() {
>
> <continued> (This is where you want to see)
>
> use_temporary_mm() {
> switch_mm_irqs_off() {
> load_new_mm_cr3() {
> write_cr3() <<--- Lock up!

I don't see how it could lock up in write_cr3(), at least on bare-metal.
What is the environment this happens, 32 or 64 bit, in a VM or
bare-metal?

I think it is more likely that your lockup is actually a page-fault
loop, where the #PF handler does not map the faulting address correctly.

But I have to look closer into how text_poke() works before I can say
more.

Btw, in case it happens on x86-64, does it also happen without
vmalloc-stacks?

Regards,

Joerg

2020-04-29 12:30:44

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Wed, 29 Apr 2020 12:59:41 +0200
Joerg Roedel <[email protected]> wrote:

> > use_temporary_mm() {
> > switch_mm_irqs_off() {
> > load_new_mm_cr3() {
> > write_cr3() <<--- Lock up!
>
> I don't see how it could lock up in write_cr3(), at least on bare-metal.
> What is the environment this happens, 32 or 64 bit, in a VM or
> bare-metal?

64 bit bare-metal. In fact, it wasn't reproducible on a VM (according
to Tzvetomir, who was just using a Fedora kernel). I only tried it on
bare-metal.

>
> I think it is more likely that your lockup is actually a page-fault
> loop, where the #PF handler does not map the faulting address correctly.

Sounds reasonable.

>
> But I have to look closer into how text_poke() works before I can say
> more.
>
> Btw, in case it happens on x86-64, does it also happen without
> vmalloc-stacks?

Just tried it out with !CONFIG_VMAP_STACKS and it still locks up :-/

-- Steve

2020-04-29 14:09:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Wed, 29 Apr 2020 08:28:54 -0400
Steven Rostedt <[email protected]> wrote:

> On Wed, 29 Apr 2020 12:59:41 +0200
> Joerg Roedel <[email protected]> wrote:
>
> > > use_temporary_mm() {
> > > switch_mm_irqs_off() {
> > > load_new_mm_cr3() {
> > > write_cr3() <<--- Lock up!
> >
> > I don't see how it could lock up in write_cr3(), at least on bare-metal.
> > What is the environment this happens, 32 or 64 bit, in a VM or
> > bare-metal?
>
> 64 bit bare-metal. In fact, it wasn't reproducible on a VM (according
> to Tzvetomir, who was just using a Fedora kernel). I only tried it on
> bare-metal.
>
> >
> > I think it is more likely that your lockup is actually a page-fault
> > loop, where the #PF handler does not map the faulting address correctly.
>
> Sounds reasonable.
>
> >
> > But I have to look closer into how text_poke() works before I can say
> > more.
> >
> > Btw, in case it happens on x86-64, does it also happen without
> > vmalloc-stacks?
>
> Just tried it out with !CONFIG_VMAP_STACKS and it still locks up :-/

Talking with Mathieu about this on IRC, he pointed out that my code does
have a vzalloc() that is called:

in trace_pid_write()

pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);

This is done when -P1,2 is on the trace-cmd command line.

-- Steve

2020-04-29 14:12:01

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
> Talking with Mathieu about this on IRC, he pointed out that my code does
> have a vzalloc() that is called:
>
> in trace_pid_write()
>
> pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
>
> This is done when -P1,2 is on the trace-cmd command line.

Yeah, I was guessing something like this, init_mm has a mapping which
poking_mm has not. I currently try to reproduce this on one of my
machines.

Regards,

Joerg

2020-04-29 14:36:29

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Wed, 29 Apr 2020 16:10:04 +0200
Joerg Roedel <[email protected]> wrote:

> On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
> > Talking with Mathieu about this on IRC, he pointed out that my code does
> > have a vzalloc() that is called:
> >
> > in trace_pid_write()
> >
> > pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
> >
> > This is done when -P1,2 is on the trace-cmd command line.
>
> Yeah, I was guessing something like this, init_mm has a mapping which
> poking_mm has not. I currently try to reproduce this on one of my
> machines.

Note, in use_temporary_mm(poking_mm), poking_mm only contains the page that
needs to be updated. But will get added to per_cpu(cpu_tlbstate.loaded_mm),
which appears to be the one that's not updated.

-- Steve

2020-04-29 15:49:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Wed, Apr 29, 2020 at 10:32:16AM -0400, Steven Rostedt wrote:
> On Wed, 29 Apr 2020 16:10:04 +0200
> Joerg Roedel <[email protected]> wrote:
>
> > On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
> > > Talking with Mathieu about this on IRC, he pointed out that my code does
> > > have a vzalloc() that is called:
> > >
> > > in trace_pid_write()
> > >
> > > pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
> > >
> > > This is done when -P1,2 is on the trace-cmd command line.
> >
> > Yeah, I was guessing something like this, init_mm has a mapping which
> > poking_mm has not. I currently try to reproduce this on one of my
> > machines.
>
> Note, in use_temporary_mm(poking_mm), poking_mm only contains the page that

poking_mm should contain the full kernel map, the userspace part is 2
pages in a random location to alias two kernel pages.

So a vmalloc fault _should_ work.

> needs to be updated. But will get added to per_cpu(cpu_tlbstate.loaded_mm),
> which appears to be the one that's not updated.
>
> -- Steve

2020-04-29 16:19:57

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
> Talking with Mathieu about this on IRC, he pointed out that my code does
> have a vzalloc() that is called:
>
> in trace_pid_write()
>
> pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
>
> This is done when -P1,2 is on the trace-cmd command line.

And that buffer is written to at any function entry?

2020-04-29 16:23:27

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Wed, Apr 29, 2020 at 06:17:47PM +0200, Joerg Roedel wrote:
> On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
> > Talking with Mathieu about this on IRC, he pointed out that my code does
> > have a vzalloc() that is called:
> >
> > in trace_pid_write()
> >
> > pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
> >
> > This is done when -P1,2 is on the trace-cmd command line.
>
> And that buffer is written to at any function entry?

What I meant to say, is it possible that the page-fault handler does not
complete because at its beginning it calls into trace-code and faults
again on the same address?

Regards,

Joerg

2020-04-29 16:54:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Wed, 29 Apr 2020 18:20:26 +0200
Joerg Roedel <[email protected]> wrote:

> On Wed, Apr 29, 2020 at 06:17:47PM +0200, Joerg Roedel wrote:
> > On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
> > > Talking with Mathieu about this on IRC, he pointed out that my code does
> > > have a vzalloc() that is called:
> > >
> > > in trace_pid_write()
> > >
> > > pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
> > >
> > > This is done when -P1,2 is on the trace-cmd command line.
> >
> > And that buffer is written to at any function entry?
>
> What I meant to say, is it possible that the page-fault handler does not
> complete because at its beginning it calls into trace-code and faults
> again on the same address?
>

It should be read only at sched_switch.

Basically, it's a big bitmask, where each bit represents a possible process
id (can be 2 gigs if we allow all positive ints!).

Then, it is only written when setting it up. Bits 1 and 2 are set here
(-P1,2). At context switch, next->pid is checked against this bitmask, and
if it is set, it means we should allow this process to be traced.

This mask should only be accessed at sched_switch time, not at other times.
And it may read any possible page in that mask depending on the process id
of the next task to be scheduled in.

-- Steve

2020-04-29 17:32:44

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

----- On Apr 29, 2020, at 12:52 PM, rostedt [email protected] wrote:

> On Wed, 29 Apr 2020 18:20:26 +0200
> Joerg Roedel <[email protected]> wrote:
>
>> On Wed, Apr 29, 2020 at 06:17:47PM +0200, Joerg Roedel wrote:
>> > On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
>> > > Talking with Mathieu about this on IRC, he pointed out that my code does
>> > > have a vzalloc() that is called:
>> > >
>> > > in trace_pid_write()
>> > >
>> > > pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
>> > >
>> > > This is done when -P1,2 is on the trace-cmd command line.
>> >
>> > And that buffer is written to at any function entry?
>>
>> What I meant to say, is it possible that the page-fault handler does not
>> complete because at its beginning it calls into trace-code and faults
>> again on the same address?
>>
>
> It should be read only at sched_switch.
>
> Basically, it's a big bitmask, where each bit represents a possible process
> id (can be 2 gigs if we allow all positive ints!).

I think you mean 2 giga-bit, for 256MB worth of memory, right ?

And AFAIU the PID_MAX_LIMIT is at a maximum of 4 million PIDs in
include/linux/threads.h, which means 512MB worth of memory for a
bitmask.

> Then, it is only written when setting it up. Bits 1 and 2 are set here
> (-P1,2). At context switch, next->pid is checked against this bitmask, and
> if it is set, it means we should allow this process to be traced.
>
> This mask should only be accessed at sched_switch time, not at other times.
> And it may read any possible page in that mask depending on the process id
> of the next task to be scheduled in.

Not sure how relevant it is, but I notice that it is also touched from IPI
context, see:

on_each_cpu(ignore_task_cpu, tr, 1);

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-04-29 18:54:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

> > Basically, it's a big bitmask, where each bit represents a possible process
> > id (can be 2 gigs if we allow all positive ints!).
>
> I think you mean 2 giga-bit, for 256MB worth of memory, right ?
>
> And AFAIU the PID_MAX_LIMIT is at a maximum of 4 million PIDs in
> include/linux/threads.h, which means 512MB worth of memory for a
> bitmask.

PID space is limited to 30 bits per FUTEX_TID_MASK, still, stupid large
:-)

2020-04-30 14:15:31

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

Hi,

On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
> Talking with Mathieu about this on IRC, he pointed out that my code does
> have a vzalloc() that is called:
>
> in trace_pid_write()
>
> pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
>
> This is done when -P1,2 is on the trace-cmd command line.

Okay, tracked it down, some instrumentation in the page-fault and
double-fault handler gave me the stack-traces. Here is what happens:

As already pointed out, it all happens because of page-faults on the
vzalloc'ed pid bitmap. It starts with this stack-trace:

RIP: 0010:trace_event_ignore_this_pid+0x23/0x30
Code: e9 c2 4b 6b 00 cc cc 48 8b 57 28 48 8b 8a b8 00 00 00 48 8b 82 c0 00 00 00 48 85 c0 74 11 48 8b 42 28 65 48 03 05 5d 9c e6 7e <0f> b6 40 7c c3 48 85 c9 75 ea f3 c3 90 48 8b 4f 70 48 83 02 01 48
RSP: 0018:ffffc90000673bd8 EFLAGS: 00010082
RAX: ffffe8ffffd8c870 RBX: 0000000000000203 RCX: ffff88810734ca90
RDX: ffff888451578000 RSI: ffffffff820f3d2a RDI: ffff888453594d68
RBP: ffff888453594d68 R08: 000000000001845e R09: ffffffff81114ba0
R10: 0000000000000000 R11: 000000000000000e R12: 4000000000000000
R13: ffffffff820f3d2a R14: 000000000001845e R15: 4000000000000002
? trace_event_raw_event_rcu_fqs+0xa0/0xa0
trace_event_raw_event_rcu_dyntick+0x89/0xa0
? trace_event_raw_event_rcu_dyntick+0x89/0xa0
? trace_event_raw_event_rcu_dyntick+0x89/0xa0
? insn_get_prefixes.part.2+0x174/0x2d0
rcu_irq_enter+0xf0/0x1d0
rcu_irq_enter_irqson+0x21/0x50
switch_mm_irqs_off+0x43c/0x570
? do_one_initcall+0x51/0x210
__text_poke+0x1a9/0x470
text_poke_bp_batch+0x73/0x180
text_poke_flush+0x43/0x50
arch_jump_label_transform_apply+0x16/0x30
__static_key_slow_dec_cpuslocked+0x42/0x50
static_key_slow_dec+0x1f/0x50
tracepoint_probe_unregister+0x1e2/0x220
trace_event_reg+0x6a/0x80
__ftrace_event_enable_disable+0x1ca/0x240
__ftrace_set_clr_event_nolock+0xe1/0x140
__ftrace_set_clr_event+0x3d/0x60
system_enable_write+0x76/0xa0
vfs_write+0xad/0x1a0
? rcu_irq_exit+0xb8/0x170
ksys_write+0x48/0xb0
do_syscall_64+0x60/0x200
entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x7f875b0ecdb0
Code: Bad RIP value.
RSP: 002b:00007ffc746be918 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f875b0ecdb0
RDX: 0000000000000001 RSI: 00007ffc746be93f RDI: 0000000000000004
RBP: 00007ffc746be9a0 R08: 636172742f6c6500 R09: 0000000002242ee0
R10: 6f662f7365636e61 R11: 0000000000000246 R12: 000000000040a4b0
R13: 00007ffc746bebd0 R14: 0000000000000000 R15: 0000000000000000

I havn't figured out how rcu_irq_enter() calls down into
trace_event_raw_event_rcu_dyntick() and further into
trace_event_ignore_this_pid(), but the stacktrace shows it does.

So trace_event_ignore_this_pid() faults on the vzalloc()'ed memory, calling
into the page-fault handler. What happens there is:

RIP: 0010:trace_event_ignore_this_pid+0x23/0x30
Code: e9 c2 4b 6b 00 cc cc 48 8b 57 28 48 8b 8a b8 00 00 00 48 8b 82 c0 00 00 00 48 85 c0 74 11 48 8b 42 28 65 48 03 05 5d 9c e6 7e <0f> b6 40 7c c3 48 85 c9 75 ea f3 c3 90 48 8b 4f 70 48 83 02 01 48
RSP: 0018:ffffc90000673a50 EFLAGS: 00010082
RAX: ffffe8ffffd8c870 RBX: 0000000000000203 RCX: ffff88810734ca90
RDX: ffff888451578000 RSI: ffffe8ffffd8c8ec RDI: ffff88844fd98478
RBP: ffff88844fd98478 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffe8ffffd8c8ec
R13: 0000000000000000 R14: ffffc90000673b28 R15: 0000000000000000
trace_event_raw_event_x86_exceptions+0x87/0xa0
? trace_event_buffer_lock_reserve+0x6e/0x110
do_page_fault+0x45e/0x630
? trace_hardirqs_off_thunk+0x1a/0x37
page_fault+0x43/0x50

The page-fault handler calls a tracing function which again ends up in
trace_event_ignore_this_pid(), where it faults again. From here on the CPU is in
a page-fault loop, which continues until the stack overflows (with
CONFIG_VMAP_STACK).

Then there is no mapped stack anymore, so the page-fault results in a
double-fault, which uses an IST stack. The double-fault handler does
ist_enter(), which calls into rcu_nmi_enter(), which also has trace-events down
its call-path. I have no stack-trace for this, but what likely happens
now is that it page-faults again while on the IST stack and the
page-fault loops until the #DF IST stack overflows. Then the next #DF
happens and the stack pointer is reset to the top of the #DF IST stack,
starting the loop over again. This loops forever, causing the hang.

Regards,

Joerg

2020-04-30 14:55:54

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Thu, Apr 30, 2020 at 04:11:20PM +0200, Joerg Roedel wrote:
> The page-fault handler calls a tracing function which again ends up in
> trace_event_ignore_this_pid(), where it faults again. From here on the CPU is in
> a page-fault loop, which continues until the stack overflows (with
> CONFIG_VMAP_STACK).

Did some more testing to find out what this issue has to do with

763802b53a42 x86/mm: split vmalloc_sync_all()

Above commit removes a call to vmalloc_sync_all() from the vmalloc
unmapping path, because that call caused severe performance regressions
on some workloads and was not needed on x86-64 anyway.

But that call caused vmalloc_sync_all() to be called regularily on
x86-64 machines, so that all page-tables were more likely to be in sync.

The call was introduced by commit

3f8fd02b1bf1 mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()

to fix a correctness issue on x86-32 PAE systems, which also need
unmappings of large pages in the vmalloc area to be synchronized.

This additional call to vmalloc_sync_all() did hide the problem. I
verified it by reverting both of the above commits on v5.7-rc3 and
testing on that kernel. The problem is reproducible there too, the box
hangs hard.

So the underlying problem is that a vmalloc()'ed tracing buffer is used
to trace the page-fault handler, so that it has no chance of faulting in
the buffer address to poking_mm and maybe other PGDs.

The right fix is to call vmalloc_sync_mappings() right after allocating
tracing or perf buffers via v[zm]alloc().

Regards,

Joerg

2020-04-30 15:22:26

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

----- On Apr 30, 2020, at 10:50 AM, Joerg Roedel [email protected] wrote:

> On Thu, Apr 30, 2020 at 04:11:20PM +0200, Joerg Roedel wrote:
>> The page-fault handler calls a tracing function which again ends up in
>> trace_event_ignore_this_pid(), where it faults again. From here on the CPU is in
>> a page-fault loop, which continues until the stack overflows (with
>> CONFIG_VMAP_STACK).
>
> Did some more testing to find out what this issue has to do with
>
> 763802b53a42 x86/mm: split vmalloc_sync_all()
>
> Above commit removes a call to vmalloc_sync_all() from the vmalloc
> unmapping path, because that call caused severe performance regressions
> on some workloads and was not needed on x86-64 anyway.
>
> But that call caused vmalloc_sync_all() to be called regularily on
> x86-64 machines, so that all page-tables were more likely to be in sync.
>
> The call was introduced by commit
>
> 3f8fd02b1bf1 mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()
>
> to fix a correctness issue on x86-32 PAE systems, which also need
> unmappings of large pages in the vmalloc area to be synchronized.
>
> This additional call to vmalloc_sync_all() did hide the problem. I
> verified it by reverting both of the above commits on v5.7-rc3 and
> testing on that kernel. The problem is reproducible there too, the box
> hangs hard.
>
> So the underlying problem is that a vmalloc()'ed tracing buffer is used
> to trace the page-fault handler, so that it has no chance of faulting in
> the buffer address to poking_mm and maybe other PGDs.
>
> The right fix is to call vmalloc_sync_mappings() right after allocating
> tracing or perf buffers via v[zm]alloc().

Either right after allocation, or right before making the vmalloc'd data
structure visible to the instrumentation. In the case of the pid filter,
that would be the rcu_assign_pointer() which publishes the new pid filter
table.

As long as vmalloc_sync_mappings() is performed somewhere *between* allocation
and publishing the pointer for instrumentation, it's fine.

I'll let Steven decide on which approach works best for him.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-04-30 15:25:19

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

----- On Apr 30, 2020, at 10:11 AM, Joerg Roedel [email protected] wrote:

> Hi,
>
> On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
>> Talking with Mathieu about this on IRC, he pointed out that my code does
>> have a vzalloc() that is called:
>>
>> in trace_pid_write()
>>
>> pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
>>
>> This is done when -P1,2 is on the trace-cmd command line.
>
> Okay, tracked it down, some instrumentation in the page-fault and
> double-fault handler gave me the stack-traces. Here is what happens:
>
> As already pointed out, it all happens because of page-faults on the
> vzalloc'ed pid bitmap. It starts with this stack-trace:
>
> RIP: 0010:trace_event_ignore_this_pid+0x23/0x30
> Code: e9 c2 4b 6b 00 cc cc 48 8b 57 28 48 8b 8a b8 00 00 00 48 8b 82 c0 00 00 00
> 48 85 c0 74 11 48 8b 42 28 65 48 03 05 5d 9c e6 7e <0f> b6 40 7c c3 48 85 c9 75
> ea f3 c3 90 48 8b 4f 70 48 83 02 01 48
> RSP: 0018:ffffc90000673bd8 EFLAGS: 00010082
> RAX: ffffe8ffffd8c870 RBX: 0000000000000203 RCX: ffff88810734ca90
> RDX: ffff888451578000 RSI: ffffffff820f3d2a RDI: ffff888453594d68
> RBP: ffff888453594d68 R08: 000000000001845e R09: ffffffff81114ba0
> R10: 0000000000000000 R11: 000000000000000e R12: 4000000000000000
> R13: ffffffff820f3d2a R14: 000000000001845e R15: 4000000000000002
> ? trace_event_raw_event_rcu_fqs+0xa0/0xa0
> trace_event_raw_event_rcu_dyntick+0x89/0xa0
> ? trace_event_raw_event_rcu_dyntick+0x89/0xa0
> ? trace_event_raw_event_rcu_dyntick+0x89/0xa0
> ? insn_get_prefixes.part.2+0x174/0x2d0
> rcu_irq_enter+0xf0/0x1d0
> rcu_irq_enter_irqson+0x21/0x50
> switch_mm_irqs_off+0x43c/0x570
> ? do_one_initcall+0x51/0x210
> __text_poke+0x1a9/0x470
> text_poke_bp_batch+0x73/0x180
> text_poke_flush+0x43/0x50
> arch_jump_label_transform_apply+0x16/0x30
> __static_key_slow_dec_cpuslocked+0x42/0x50
> static_key_slow_dec+0x1f/0x50
> tracepoint_probe_unregister+0x1e2/0x220
> trace_event_reg+0x6a/0x80
> __ftrace_event_enable_disable+0x1ca/0x240
> __ftrace_set_clr_event_nolock+0xe1/0x140
> __ftrace_set_clr_event+0x3d/0x60
> system_enable_write+0x76/0xa0
> vfs_write+0xad/0x1a0
> ? rcu_irq_exit+0xb8/0x170
> ksys_write+0x48/0xb0
> do_syscall_64+0x60/0x200
> entry_SYSCALL_64_after_hwframe+0x49/0xb3
> RIP: 0033:0x7f875b0ecdb0
> Code: Bad RIP value.
> RSP: 002b:00007ffc746be918 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f875b0ecdb0
> RDX: 0000000000000001 RSI: 00007ffc746be93f RDI: 0000000000000004
> RBP: 00007ffc746be9a0 R08: 636172742f6c6500 R09: 0000000002242ee0
> R10: 6f662f7365636e61 R11: 0000000000000246 R12: 000000000040a4b0
> R13: 00007ffc746bebd0 R14: 0000000000000000 R15: 0000000000000000
>
> I havn't figured out how rcu_irq_enter() calls down into
> trace_event_raw_event_rcu_dyntick() and further into
> trace_event_ignore_this_pid(), but the stacktrace shows it does.
>
> So trace_event_ignore_this_pid() faults on the vzalloc()'ed memory, calling
> into the page-fault handler. What happens there is:
>
> RIP: 0010:trace_event_ignore_this_pid+0x23/0x30
> Code: e9 c2 4b 6b 00 cc cc 48 8b 57 28 48 8b 8a b8 00 00 00 48 8b 82 c0 00 00 00
> 48 85 c0 74 11 48 8b 42 28 65 48 03 05 5d 9c e6 7e <0f> b6 40 7c c3 48 85 c9 75
> ea f3 c3 90 48 8b 4f 70 48 83 02 01 48
> RSP: 0018:ffffc90000673a50 EFLAGS: 00010082
> RAX: ffffe8ffffd8c870 RBX: 0000000000000203 RCX: ffff88810734ca90
> RDX: ffff888451578000 RSI: ffffe8ffffd8c8ec RDI: ffff88844fd98478
> RBP: ffff88844fd98478 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: ffffe8ffffd8c8ec
> R13: 0000000000000000 R14: ffffc90000673b28 R15: 0000000000000000
> trace_event_raw_event_x86_exceptions+0x87/0xa0
> ? trace_event_buffer_lock_reserve+0x6e/0x110
> do_page_fault+0x45e/0x630
> ? trace_hardirqs_off_thunk+0x1a/0x37
> page_fault+0x43/0x50
>
> The page-fault handler calls a tracing function which again ends up in
> trace_event_ignore_this_pid(), where it faults again. From here on the CPU is in
> a page-fault loop, which continues until the stack overflows (with
> CONFIG_VMAP_STACK).
>
> Then there is no mapped stack anymore, so the page-fault results in a
> double-fault, which uses an IST stack. The double-fault handler does
> ist_enter(), which calls into rcu_nmi_enter(), which also has trace-events down
> its call-path. I have no stack-trace for this, but what likely happens
> now is that it page-faults again while on the IST stack and the
> page-fault loops until the #DF IST stack overflows. Then the next #DF
> happens and the stack pointer is reset to the top of the #DF IST stack,
> starting the loop over again. This loops forever, causing the hang.

I suspect it's called by:

include/linux/trace_event.h:trace_trigger_soft_disabled():

static inline bool
trace_trigger_soft_disabled(struct trace_event_file *file)
{
unsigned long eflags = file->flags;

if (!(eflags & EVENT_FILE_FL_TRIGGER_COND)) {
if (eflags & EVENT_FILE_FL_TRIGGER_MODE)
event_triggers_call(file, NULL, NULL);
if (eflags & EVENT_FILE_FL_SOFT_DISABLED)
return true;
if (eflags & EVENT_FILE_FL_PID_FILTER)
return trace_event_ignore_this_pid(file);
}
return false;
}

Thanks,

Mathieu

>
> Regards,
>
> Joerg

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-04-30 16:14:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Thu, 30 Apr 2020 11:23:29 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> I suspect it's called by:
>
> include/linux/trace_event.h:trace_trigger_soft_disabled():
>
> static inline bool
> trace_trigger_soft_disabled(struct trace_event_file *file)
> {
> unsigned long eflags = file->flags;
>
> if (!(eflags & EVENT_FILE_FL_TRIGGER_COND)) {
> if (eflags & EVENT_FILE_FL_TRIGGER_MODE)
> event_triggers_call(file, NULL, NULL);
> if (eflags & EVENT_FILE_FL_SOFT_DISABLED)
> return true;
> if (eflags & EVENT_FILE_FL_PID_FILTER)
> return trace_event_ignore_this_pid(file);
> }
> return false;
> }

This code is only used for the "enable/disable event" triggers. Which isn't
the case here. So this should never be called.

-- Steve

2020-04-30 16:15:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Thu, 30 Apr 2020 16:11:21 +0200
Joerg Roedel <[email protected]> wrote:

> Hi,
>
> On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
> > Talking with Mathieu about this on IRC, he pointed out that my code does
> > have a vzalloc() that is called:
> >
> > in trace_pid_write()
> >
> > pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
> >
> > This is done when -P1,2 is on the trace-cmd command line.
>
> Okay, tracked it down, some instrumentation in the page-fault and
> double-fault handler gave me the stack-traces. Here is what happens:
>
> As already pointed out, it all happens because of page-faults on the
> vzalloc'ed pid bitmap. It starts with this stack-trace:
>
> RIP: 0010:trace_event_ignore_this_pid+0x23/0x30

Interesting. Because that function is this:

bool trace_event_ignore_this_pid(struct trace_event_file *trace_file)
{
struct trace_array *tr = trace_file->tr;
struct trace_array_cpu *data;
struct trace_pid_list *no_pid_list;
struct trace_pid_list *pid_list;

pid_list = rcu_dereference_raw(tr->filtered_pids);
no_pid_list = rcu_dereference_raw(tr->filtered_no_pids);

if (!pid_list && !no_pid_list)
return false;

data = this_cpu_ptr(tr->array_buffer.data);

return data->ignore_pid;
}

Where it only sees if the pid masks exist. That is, it looks to see if
there's pointers to them, it doesn't actually touch the vmalloc'd area.
This check is to handle a race between allocating and deallocating the
buffers and setting the ignore_pid bit. The reading of these arrays is done
at sched_switch time, which sets or clears the ignore_pid field.

That said, since this only happens on buffer instances (it does not trigger
on the top level instance, which uses the same code for the pid masks)

Could this possibly be for the tr->array_buffer.data, which is allocated
with:

allocate_trace_buffer() {
[..]
buf->data = alloc_percpu(struct trace_array_cpu);

That is, the bug isn't the vmalloc being a problem, but perhaps the per_cpu
allocation. This would explain why this crashes with the buffer instance
and not with the top level instance. If it was related to the pid masks,
then it would trigger for either (because they act the same in allocating
at time of use). But when an instance is made, the tr->array_buffer.data is
created. Which for the top level happens at boot up and the pages would
have been synced long ago. But for a newly created instance, this happens
just before its used. This could possibly explain why it's not a problem
when doing it manually by hand, because the time between creating the
instance, and the time to start and stop the tracing, is long enough for
something to sync them page tables.

tl;dr; It's not an issue with the vmalloc, it's an issue with per_cpu
allocations!


-- Steve

2020-04-30 16:18:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Thu, 30 Apr 2020 11:20:15 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> > The right fix is to call vmalloc_sync_mappings() right after allocating
> > tracing or perf buffers via v[zm]alloc().
>
> Either right after allocation, or right before making the vmalloc'd data
> structure visible to the instrumentation. In the case of the pid filter,
> that would be the rcu_assign_pointer() which publishes the new pid filter
> table.
>
> As long as vmalloc_sync_mappings() is performed somewhere *between* allocation
> and publishing the pointer for instrumentation, it's fine.
>
> I'll let Steven decide on which approach works best for him.

As stated in the other email, I don't see it having anything to do with
vmalloc, but with the per_cpu() allocation. I'll test this theory out by
not even allocating the pid masks and touching the per cpu data at every
event to see if it crashes.

-- Steve

2020-04-30 16:20:17

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

----- On Apr 30, 2020, at 12:11 PM, rostedt [email protected] wrote:

> On Thu, 30 Apr 2020 16:11:21 +0200
> Joerg Roedel <[email protected]> wrote:
>
>> Hi,
>>
>> On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
>> > Talking with Mathieu about this on IRC, he pointed out that my code does
>> > have a vzalloc() that is called:
>> >
>> > in trace_pid_write()
>> >
>> > pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
>> >
>> > This is done when -P1,2 is on the trace-cmd command line.
>>
>> Okay, tracked it down, some instrumentation in the page-fault and
>> double-fault handler gave me the stack-traces. Here is what happens:
>>
>> As already pointed out, it all happens because of page-faults on the
>> vzalloc'ed pid bitmap. It starts with this stack-trace:
>>
>> RIP: 0010:trace_event_ignore_this_pid+0x23/0x30
>
> Interesting. Because that function is this:
>
> bool trace_event_ignore_this_pid(struct trace_event_file *trace_file)
> {
> struct trace_array *tr = trace_file->tr;
> struct trace_array_cpu *data;
> struct trace_pid_list *no_pid_list;
> struct trace_pid_list *pid_list;
>
> pid_list = rcu_dereference_raw(tr->filtered_pids);
> no_pid_list = rcu_dereference_raw(tr->filtered_no_pids);
>
> if (!pid_list && !no_pid_list)
> return false;
>
> data = this_cpu_ptr(tr->array_buffer.data);
>
> return data->ignore_pid;
> }
>
> Where it only sees if the pid masks exist. That is, it looks to see if
> there's pointers to them, it doesn't actually touch the vmalloc'd area.
> This check is to handle a race between allocating and deallocating the
> buffers and setting the ignore_pid bit. The reading of these arrays is done
> at sched_switch time, which sets or clears the ignore_pid field.
>
> That said, since this only happens on buffer instances (it does not trigger
> on the top level instance, which uses the same code for the pid masks)
>
> Could this possibly be for the tr->array_buffer.data, which is allocated
> with:
>
> allocate_trace_buffer() {
> [..]
> buf->data = alloc_percpu(struct trace_array_cpu);
>
> That is, the bug isn't the vmalloc being a problem, but perhaps the per_cpu
> allocation. This would explain why this crashes with the buffer instance
> and not with the top level instance. If it was related to the pid masks,
> then it would trigger for either (because they act the same in allocating
> at time of use). But when an instance is made, the tr->array_buffer.data is
> created. Which for the top level happens at boot up and the pages would
> have been synced long ago. But for a newly created instance, this happens
> just before its used. This could possibly explain why it's not a problem
> when doing it manually by hand, because the time between creating the
> instance, and the time to start and stop the tracing, is long enough for
> something to sync them page tables.
>
> tl;dr; It's not an issue with the vmalloc, it's an issue with per_cpu
> allocations!

Did I mention that alloc_percpu uses:

static void *pcpu_mem_zalloc(size_t size, gfp_t gfp)
{
if (WARN_ON_ONCE(!slab_is_available()))
return NULL;

if (size <= PAGE_SIZE)
return kzalloc(size, gfp);
else
return __vmalloc(size, gfp | __GFP_ZERO, PAGE_KERNEL);
}

So yeah, it's vmalloc'd memory when size > PAGE_SIZE.

Thanks,

Mathieu




--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-04-30 16:22:36

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

----- On Apr 30, 2020, at 12:16 PM, rostedt [email protected] wrote:

> On Thu, 30 Apr 2020 11:20:15 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> > The right fix is to call vmalloc_sync_mappings() right after allocating
>> > tracing or perf buffers via v[zm]alloc().
>>
>> Either right after allocation, or right before making the vmalloc'd data
>> structure visible to the instrumentation. In the case of the pid filter,
>> that would be the rcu_assign_pointer() which publishes the new pid filter
>> table.
>>
>> As long as vmalloc_sync_mappings() is performed somewhere *between* allocation
>> and publishing the pointer for instrumentation, it's fine.
>>
>> I'll let Steven decide on which approach works best for him.
>
> As stated in the other email, I don't see it having anything to do with
> vmalloc, but with the per_cpu() allocation. I'll test this theory out by
> not even allocating the pid masks and touching the per cpu data at every
> event to see if it crashes.

As pointed out in my other email, per-cpu allocation uses vmalloc when
size > PAGE_SIZE.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-04-30 16:28:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Thu, 30 Apr 2020 12:16:20 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> ----- On Apr 30, 2020, at 12:11 PM, rostedt [email protected] wrote:
>
> > On Thu, 30 Apr 2020 16:11:21 +0200
> > Joerg Roedel <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
> >> > Talking with Mathieu about this on IRC, he pointed out that my code does
> >> > have a vzalloc() that is called:
> >> >
> >> > in trace_pid_write()
> >> >
> >> > pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
> >> >
> >> > This is done when -P1,2 is on the trace-cmd command line.
> >>
> >> Okay, tracked it down, some instrumentation in the page-fault and
> >> double-fault handler gave me the stack-traces. Here is what happens:
> >>
> >> As already pointed out, it all happens because of page-faults on the
> >> vzalloc'ed pid bitmap. It starts with this stack-trace:
> >>
> >> RIP: 0010:trace_event_ignore_this_pid+0x23/0x30
> >
> > Interesting. Because that function is this:
> >
> > bool trace_event_ignore_this_pid(struct trace_event_file *trace_file)
> > {
> > struct trace_array *tr = trace_file->tr;
> > struct trace_array_cpu *data;
> > struct trace_pid_list *no_pid_list;
> > struct trace_pid_list *pid_list;
> >
> > pid_list = rcu_dereference_raw(tr->filtered_pids);
> > no_pid_list = rcu_dereference_raw(tr->filtered_no_pids);
> >
> > if (!pid_list && !no_pid_list)
> > return false;
> >
> > data = this_cpu_ptr(tr->array_buffer.data);
> >
> > return data->ignore_pid;
> > }
> >
> > Where it only sees if the pid masks exist. That is, it looks to see if
> > there's pointers to them, it doesn't actually touch the vmalloc'd area.
> > This check is to handle a race between allocating and deallocating the
> > buffers and setting the ignore_pid bit. The reading of these arrays is done
> > at sched_switch time, which sets or clears the ignore_pid field.
> >
> > That said, since this only happens on buffer instances (it does not trigger
> > on the top level instance, which uses the same code for the pid masks)
> >
> > Could this possibly be for the tr->array_buffer.data, which is allocated
> > with:
> >
> > allocate_trace_buffer() {
> > [..]
> > buf->data = alloc_percpu(struct trace_array_cpu);
> >
> > That is, the bug isn't the vmalloc being a problem, but perhaps the per_cpu
> > allocation. This would explain why this crashes with the buffer instance
> > and not with the top level instance. If it was related to the pid masks,
> > then it would trigger for either (because they act the same in allocating
> > at time of use). But when an instance is made, the tr->array_buffer.data is
> > created. Which for the top level happens at boot up and the pages would
> > have been synced long ago. But for a newly created instance, this happens
> > just before its used. This could possibly explain why it's not a problem
> > when doing it manually by hand, because the time between creating the
> > instance, and the time to start and stop the tracing, is long enough for
> > something to sync them page tables.
> >
> > tl;dr; It's not an issue with the vmalloc, it's an issue with per_cpu
> > allocations!
>
> Did I mention that alloc_percpu uses:
>
> static void *pcpu_mem_zalloc(size_t size, gfp_t gfp)
> {
> if (WARN_ON_ONCE(!slab_is_available()))
> return NULL;
>
> if (size <= PAGE_SIZE)
> return kzalloc(size, gfp);
> else
> return __vmalloc(size, gfp | __GFP_ZERO, PAGE_KERNEL);
> }
>
> So yeah, it's vmalloc'd memory when size > PAGE_SIZE.
>

I certainly hope that struct trace_array_cpu is not bigger than PAGE_SIZE!

-- Steve

2020-04-30 16:34:36

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Thu, 30 Apr 2020 12:18:22 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> ----- On Apr 30, 2020, at 12:16 PM, rostedt [email protected] wrote:
>
> > On Thu, 30 Apr 2020 11:20:15 -0400 (EDT)
> > Mathieu Desnoyers <[email protected]> wrote:
> >
> >> > The right fix is to call vmalloc_sync_mappings() right after allocating
> >> > tracing or perf buffers via v[zm]alloc().
> >>
> >> Either right after allocation, or right before making the vmalloc'd data
> >> structure visible to the instrumentation. In the case of the pid filter,
> >> that would be the rcu_assign_pointer() which publishes the new pid filter
> >> table.
> >>
> >> As long as vmalloc_sync_mappings() is performed somewhere *between* allocation
> >> and publishing the pointer for instrumentation, it's fine.
> >>
> >> I'll let Steven decide on which approach works best for him.
> >
> > As stated in the other email, I don't see it having anything to do with
> > vmalloc, but with the per_cpu() allocation. I'll test this theory out by
> > not even allocating the pid masks and touching the per cpu data at every
> > event to see if it crashes.
>
> As pointed out in my other email, per-cpu allocation uses vmalloc when
> size > PAGE_SIZE.

And as I replied:

buf->data = alloc_percpu(struct trace_array_cpu);

struct trace_array_cpu {
atomic_t disabled;
void *buffer_page; /* ring buffer spare */

unsigned long entries;
unsigned long saved_latency;
unsigned long critical_start;
unsigned long critical_end;
unsigned long critical_sequence;
unsigned long nice;
unsigned long policy;
unsigned long rt_priority;
unsigned long skipped_entries;
u64 preempt_timestamp;
pid_t pid;
kuid_t uid;
char comm[TASK_COMM_LEN];

bool ignore_pid;
#ifdef CONFIG_FUNCTION_TRACER
bool ftrace_ignore_pid;
#endif
};

That doesn't look bigger than PAGE_SIZE to me.

-- Steve

2020-04-30 16:37:26

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

----- On Apr 30, 2020, at 12:30 PM, rostedt [email protected] wrote:

> On Thu, 30 Apr 2020 12:18:22 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> ----- On Apr 30, 2020, at 12:16 PM, rostedt [email protected] wrote:
>>
>> > On Thu, 30 Apr 2020 11:20:15 -0400 (EDT)
>> > Mathieu Desnoyers <[email protected]> wrote:
>> >
>> >> > The right fix is to call vmalloc_sync_mappings() right after allocating
>> >> > tracing or perf buffers via v[zm]alloc().
>> >>
>> >> Either right after allocation, or right before making the vmalloc'd data
>> >> structure visible to the instrumentation. In the case of the pid filter,
>> >> that would be the rcu_assign_pointer() which publishes the new pid filter
>> >> table.
>> >>
>> >> As long as vmalloc_sync_mappings() is performed somewhere *between* allocation
>> >> and publishing the pointer for instrumentation, it's fine.
>> >>
>> >> I'll let Steven decide on which approach works best for him.
>> >
>> > As stated in the other email, I don't see it having anything to do with
>> > vmalloc, but with the per_cpu() allocation. I'll test this theory out by
>> > not even allocating the pid masks and touching the per cpu data at every
>> > event to see if it crashes.
>>
>> As pointed out in my other email, per-cpu allocation uses vmalloc when
>> size > PAGE_SIZE.
>
> And as I replied:
>
> buf->data = alloc_percpu(struct trace_array_cpu);
>
> struct trace_array_cpu {
> atomic_t disabled;
> void *buffer_page; /* ring buffer spare */
>
> unsigned long entries;
> unsigned long saved_latency;
> unsigned long critical_start;
> unsigned long critical_end;
> unsigned long critical_sequence;
> unsigned long nice;
> unsigned long policy;
> unsigned long rt_priority;
> unsigned long skipped_entries;
> u64 preempt_timestamp;
> pid_t pid;
> kuid_t uid;
> char comm[TASK_COMM_LEN];
>
> bool ignore_pid;
> #ifdef CONFIG_FUNCTION_TRACER
> bool ftrace_ignore_pid;
> #endif
> };
>
> That doesn't look bigger than PAGE_SIZE to me.

Let me point you to:

pcpu_alloc()
calling pcpu_create_chunk()

which is then responsible for calling the underlying
pcpu_mem_zalloc() which then uses vmalloc. So batching
those allocations can be responsible for using vmalloc'd
memory rather than kmalloc'd even though the allocation
size is smaller than 4kB.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-04-30 19:16:37

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Thu, Apr 30, 2020 at 12:11:36PM -0400, Steven Rostedt wrote:
> bool trace_event_ignore_this_pid(struct trace_event_file *trace_file)
> {
> struct trace_array *tr = trace_file->tr;
> struct trace_array_cpu *data;
> struct trace_pid_list *no_pid_list;
> struct trace_pid_list *pid_list;
>
> pid_list = rcu_dereference_raw(tr->filtered_pids);
> no_pid_list = rcu_dereference_raw(tr->filtered_no_pids);
>
> if (!pid_list && !no_pid_list)
> return false;
>
> data = this_cpu_ptr(tr->array_buffer.data);
>
> return data->ignore_pid;
> }
>
> [...]
>
> tl;dr; It's not an issue with the vmalloc, it's an issue with per_cpu
> allocations!

Yes, looks like you are right, it faults on the return statement, so the
data-pointer seems not to be mapped. I looked at another instance:

The trapping instruction is:

movzbl 0x7c(%rax),%eax

Which is a pointer-dereference, with %rax=0xffffe8ffffccc870. The 'data'
pointer is allocated with alloc_percpu().

And alloc_percpu() calls down into pcpu_alloc(), which allocates new
percpu chunks using vmalloc() on x86. And there we are again in the
vmalloc area.

Regards,

Joerg

2020-05-01 01:14:58

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()


[ Joerg, sending again this time not just to you. (hit reply to sender
and not reply to all). Feel free to resend what you wrote before to this ]

On Thu, 30 Apr 2020 21:14:34 +0200
Joerg Roedel <[email protected]> wrote:

> And alloc_percpu() calls down into pcpu_alloc(), which allocates new
> percpu chunks using vmalloc() on x86. And there we are again in the
> vmalloc area.

So after a vmalloc() is made, should the page tables be synced?

This is a rather subtle bug, and I don't think it should be the caller of
percpu_alloc() that needs to call vmalloc_sync_mappings().

What's your suggestion for a fix?

-- Steve

2020-05-01 02:28:51

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

----- On Apr 30, 2020, at 9:13 PM, rostedt [email protected] wrote:

> [ Joerg, sending again this time not just to you. (hit reply to sender
> and not reply to all). Feel free to resend what you wrote before to this ]
>
> On Thu, 30 Apr 2020 21:14:34 +0200
> Joerg Roedel <[email protected]> wrote:
>
>> And alloc_percpu() calls down into pcpu_alloc(), which allocates new
>> percpu chunks using vmalloc() on x86. And there we are again in the
>> vmalloc area.
>
> So after a vmalloc() is made, should the page tables be synced?

Why should it ? Usually, the page fault handler is able to resolve the
resulting minor page faults lazily.

>
> This is a rather subtle bug, and I don't think it should be the caller of
> percpu_alloc() that needs to call vmalloc_sync_mappings().

Who said tracing was easy ? ;-)

> What's your suggestion for a fix?

I know the question is not addressed to me, but here are my 2 cents:

It's subtle because ftrace is tracing the page fault handler through
tracepoints. It would not make sense to slow down all vmalloc or
percpu_alloc() just because tracing recurses when tracing page faults.

I think the right approach to solve this is to call vmalloc_sync_mappings()
before any vmalloc'd memory ends up being observable by instrumentation.
This can be achieved by adding a vmalloc_sync_mappings call on tracepoint
registration like I proposed in my patchset a few week ago:

https://lore.kernel.org/r/[email protected]

The tracers just have to make sure they perform their vmalloc'd memory
allocation before registering the tracepoint which can touch it, else they
need to issue vmalloc_sync_mappings() on their own before making the
newly allocated memory observable by instrumentation.

This approach is not new: register_die_notifier() does exactly that today.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-05-01 02:41:10

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Thu, 30 Apr 2020 22:26:55 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> ----- On Apr 30, 2020, at 9:13 PM, rostedt [email protected] wrote:
>
> > [ Joerg, sending again this time not just to you. (hit reply to sender
> > and not reply to all). Feel free to resend what you wrote before to this ]
> >
> > On Thu, 30 Apr 2020 21:14:34 +0200
> > Joerg Roedel <[email protected]> wrote:
> >
> >> And alloc_percpu() calls down into pcpu_alloc(), which allocates new
> >> percpu chunks using vmalloc() on x86. And there we are again in the
> >> vmalloc area.
> >
> > So after a vmalloc() is made, should the page tables be synced?
>
> Why should it ? Usually, the page fault handler is able to resolve the
> resulting minor page faults lazily.
>
> >
> > This is a rather subtle bug, and I don't think it should be the caller of
> > percpu_alloc() that needs to call vmalloc_sync_mappings().
>
> Who said tracing was easy ? ;-)

But anyone can hook to a tracepoint, and then if they hook to one that is
in the page fault handler, and they use vmalloc, they can lock up the
machine.

>
> > What's your suggestion for a fix?
>
> I know the question is not addressed to me, but here are my 2 cents:
>
> It's subtle because ftrace is tracing the page fault handler through
> tracepoints. It would not make sense to slow down all vmalloc or
> percpu_alloc() just because tracing recurses when tracing page faults.

What's so damn special about alloc_percpu()? It's definitely not a fast
path. And it's not used often.

>
> I think the right approach to solve this is to call vmalloc_sync_mappings()
> before any vmalloc'd memory ends up being observable by instrumentation.
> This can be achieved by adding a vmalloc_sync_mappings call on tracepoint
> registration like I proposed in my patchset a few week ago:
>
> https://lore.kernel.org/r/[email protected]
>
> The tracers just have to make sure they perform their vmalloc'd memory
> allocation before registering the tracepoint which can touch it, else they
> need to issue vmalloc_sync_mappings() on their own before making the
> newly allocated memory observable by instrumentation.
>
> This approach is not new: register_die_notifier() does exactly that today.
>

I'll give the answer I gave to Joerg when he replied to my accidental
private (not public) email:

Or even my original patch would be better than having the generic tracing
code understanding the intrinsic properties of vmalloc() and
alloc_percpu() on x86_64. I really don't think it is wise to have:

foo = alloc_percpu();

/*
* Because of some magic with the way alloc_percpu() works on
* x86_64, we need to synchronize the pgd of all the tables,
* otherwise the trace events that happen in x86_64 page fault
* handlers can't cope with accessing the chance that a
* alloc_percpu()'d memory might be touched in the page fault trace
* event. Oh, and we need to audit all alloc_percpu() and vmalloc()
* calls in tracing, because something might get triggered within a
* page fault trace event!
*/
vmalloc_sync_mappings();

That would be exactly what I add as a comment if it were to be added in the
generic tracing code.

And we would need to audit any percpu alloc'd code in all tracing, or
anything that might git hooked into something that hooks to the page fault
trace point.

Since this worked for a decade without this, I'm strongly against adding it
in the generic code due to some issues with a single architecture.

-- Steve

2020-05-01 04:22:12

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Thu, 30 Apr 2020 22:26:55 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> The tracers just have to make sure they perform their vmalloc'd memory
> allocation before registering the tracepoint which can touch it, else they
> need to issue vmalloc_sync_mappings() on their own before making the
> newly allocated memory observable by instrumentation.

What gets me is that I added the patch below (which adds a
vmalloc_sync_mappings() just after the alloc_percpu()), but I also recorded
all instances of vmalloc() with a stackdump, and I get this:

colord-1673 [002] .... 84.764804: __vmalloc_node_range+0x5/0x2c0: vmalloc called here
colord-1673 [002] .... 84.764807: <stack trace>
=> __ftrace_trace_stack+0x161/0x1a0
=> __vmalloc_node_range+0x4d/0x2c0
=> module_alloc+0x7e/0xd0
=> bpf_jit_binary_alloc+0x70/0x110
=> bpf_int_jit_compile+0x139/0x40a
=> bpf_prog_select_runtime+0xa3/0x120
=> bpf_prepare_filter+0x533/0x5a0
=> sk_attach_filter+0x13/0x50
=> sock_setsockopt+0xd2f/0xf90
=> __sys_setsockopt+0x18a/0x1a0
=> __x64_sys_setsockopt+0x20/0x30
=> do_syscall_64+0x60/0x230
=> entry_SYSCALL_64_after_hwframe+0x49/0xb3
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0


[ the above is from before the tracing started ]

trace-cmd-1687 [002] .... 103.908850: __vmalloc_node_range+0x5/0x2c0: vmalloc called here
trace-cmd-1687 [002] .... 103.908856: <stack trace>
=> __ftrace_trace_stack+0x161/0x1a0
=> __vmalloc_node_range+0x4d/0x2c0
=> vzalloc+0x48/0x50
=> trace_pid_write+0x23d/0x2b0
=> pid_write.isra.62+0xd1/0x2f0
=> vfs_write+0xa8/0x1b0
=> ksys_write+0x67/0xe0
=> do_syscall_64+0x60/0x230
=> entry_SYSCALL_64_after_hwframe+0x49/0xb3
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
trace-cmd-1697 [003] .... 104.088950: __vmalloc_node_range+0x5/0x2c0: vmalloc called here
trace-cmd-1697 [003] .... 104.088954: <stack trace>
=> __ftrace_trace_stack+0x161/0x1a0
=> __vmalloc_node_range+0x4d/0x2c0
=> vzalloc+0x48/0x50
=> trace_pid_write+0x23d/0x2b0
=> pid_write.isra.62+0xd1/0x2f0
=> vfs_write+0xa8/0x1b0
=> ksys_write+0x67/0xe0
=> do_syscall_64+0x60/0x230
=> entry_SYSCALL_64_after_hwframe+0x49/0xb3
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
trace-cmd-1697 [003] .... 104.089666: __vmalloc_node_range+0x5/0x2c0: vmalloc called here
trace-cmd-1697 [003] .... 104.089669: <stack trace>
=> __ftrace_trace_stack+0x161/0x1a0
=> __vmalloc_node_range+0x4d/0x2c0
=> vzalloc+0x48/0x50
=> trace_pid_write+0xc1/0x2b0
=> pid_write.isra.62+0xd1/0x2f0
=> vfs_write+0xa8/0x1b0
=> ksys_write+0x67/0xe0
=> do_syscall_64+0x60/0x230
=> entry_SYSCALL_64_after_hwframe+0x49/0xb3
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
trace-cmd-1697 [003] .... 104.098920: __vmalloc_node_range+0x5/0x2c0: vmalloc called here
trace-cmd-1697 [003] .... 104.098924: <stack trace>
=> __ftrace_trace_stack+0x161/0x1a0
=> __vmalloc_node_range+0x4d/0x2c0
=> vzalloc+0x48/0x50
=> trace_pid_write+0xc1/0x2b0
=> pid_write.isra.62+0xd1/0x2f0
=> vfs_write+0xa8/0x1b0
=> ksys_write+0x67/0xe0
=> do_syscall_64+0x60/0x230
=> entry_SYSCALL_64_after_hwframe+0x49/0xb3
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
trace-cmd-1697 [003] .... 104.114518: __vmalloc_node_range+0x5/0x2c0: vmalloc called here
trace-cmd-1697 [003] .... 104.114520: <stack trace>
=> __ftrace_trace_stack+0x161/0x1a0
=> __vmalloc_node_range+0x4d/0x2c0
=> vzalloc+0x48/0x50
=> trace_pid_write+0xc1/0x2b0
=> pid_write.isra.62+0xd1/0x2f0
=> vfs_write+0xa8/0x1b0
=> ksys_write+0x67/0xe0
=> do_syscall_64+0x60/0x230
=> entry_SYSCALL_64_after_hwframe+0x49/0xb3
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
trace-cmd-1697 [003] .... 104.130705: __vmalloc_node_range+0x5/0x2c0: vmalloc called here
trace-cmd-1697 [003] .... 104.130707: <stack trace>
=> __ftrace_trace_stack+0x161/0x1a0
=> __vmalloc_node_range+0x4d/0x2c0
=> vzalloc+0x48/0x50
=> trace_pid_write+0x23d/0x2b0
=> event_pid_write.isra.30+0x21b/0x3b0
=> vfs_write+0xa8/0x1b0
=> ksys_write+0x67/0xe0
=> do_syscall_64+0x60/0x230
=> entry_SYSCALL_64_after_hwframe+0x49/0xb3
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
trace-cmd-1687 [001] .... 106.000510: __vmalloc_node_range+0x5/0x2c0: vmalloc called here
trace-cmd-1687 [001] .... 106.000514: <stack trace>
=> __ftrace_trace_stack+0x161/0x1a0
=> __vmalloc_node_range+0x4d/0x2c0
=> vzalloc+0x48/0x50
=> trace_pid_write+0x23d/0x2b0
=> pid_write.isra.62+0xd1/0x2f0
=> vfs_write+0xa8/0x1b0
=> ksys_write+0x67/0xe0
=> do_syscall_64+0x60/0x230
=> entry_SYSCALL_64_after_hwframe+0x49/0xb3
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0
=> 0

The above is the calls to adding pids to set_event_pid. (I see I should
probably make that code a bit more efficient, it calls the vmalloc code a
bit too much).

But what is missing, is the call to vmalloc from alloc_percpu(). In fact, I
put in printks in the vmalloc() that's in alloc_percpu() and it doesn't
trigger from the tracing code, and it does show up in my trace from other
areas of the kernel:

kworker/1:3-204 [001] .... 42.888340: __vmalloc_node_range+0x5/0x2c0: vmalloc called here
kworker/1:3-204 [001] .... 42.888342: <stack trace>
=> __ftrace_trace_stack+0x161/0x1a0
=> __vmalloc_node_range+0x4d/0x2c0
=> __vmalloc+0x30/0x40
=> pcpu_create_chunk+0x77/0x220
=> pcpu_balance_workfn+0x407/0x650
=> process_one_work+0x25e/0x5c0
=> worker_thread+0x30/0x380
=> kthread+0x139/0x160
=> ret_from_fork+0x3a/0x50

So I'm still not 100% sure why the percpu data is causing a problem?

-- Steve

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8d2b98812625..10e4970a150c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8486,6 +8486,7 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size
return -ENOMEM;

buf->data = alloc_percpu(struct trace_array_cpu);
+ vmalloc_sync_mappings();
if (!buf->data) {
ring_buffer_free(buf->buffer);
buf->buffer = NULL;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 9a8227afa073..489cf0620edc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2543,6 +2543,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
void *addr;
unsigned long real_size = size;

+ trace_printk("vmalloc called here\n");
+ trace_dump_stack(0);
size = PAGE_ALIGN(size);
if (!size || (size >> PAGE_SHIFT) > totalram_pages())
goto fail;

2020-05-01 10:20:05

by Joerg Roedel

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

On Thu, Apr 30, 2020 at 10:39:19PM -0400, Steven Rostedt wrote:
> I'll give the answer I gave to Joerg when he replied to my accidental
> private (not public) email:
>
> Or even my original patch would be better than having the generic tracing
> code understanding the intrinsic properties of vmalloc() and
> alloc_percpu() on x86_64. I really don't think it is wise to have:
>
> foo = alloc_percpu();
>
> /*
> * Because of some magic with the way alloc_percpu() works on
> * x86_64, we need to synchronize the pgd of all the tables,
> * otherwise the trace events that happen in x86_64 page fault
> * handlers can't cope with accessing the chance that a
> * alloc_percpu()'d memory might be touched in the page fault trace
> * event. Oh, and we need to audit all alloc_percpu() and vmalloc()
> * calls in tracing, because something might get triggered within a
> * page fault trace event!
> */
> vmalloc_sync_mappings();
>
> That would be exactly what I add as a comment if it were to be added in the
> generic tracing code.
>
> And we would need to audit any percpu alloc'd code in all tracing, or
> anything that might git hooked into something that hooks to the page fault
> trace point.
>
> Since this worked for a decade without this, I'm strongly against adding it
> in the generic code due to some issues with a single architecture.

That is exactly the problem with vmalloc_sync_mappings()/unmappings().
It is not at all clear when it needs to be called and why, or even who
needs is responsible for calling it. The existing call-sites in Notifier
and ACPI code have no comment on why it is necessary to synchronize the
vmalloc mappings there.

It is only needed for x86, we could also get rid of it completely if:

1) At x86-64 we pre-allocate all 64 P4D/PUD pages for the
vmalloc area in init_mm at boot time. This needs 256kb of
memory per system, most of it potentially unused as each
P4D/PUD maps 512GB of address space.

2) At x86-32 we need to disable large pages for vmalloc/ioremap
mappings and pre-allocate the PTE pages for the vmalloc area
in init_mm. Depending on how much memory the system has and
the configured kernel/user split this might take more than 64
pages.

With that we could get rid of the vmalloc_sync interface and also the
vmalloc-fault code in general and reduce the complexity. This interface
has caused problems more than once. On the other side it would trade
memory usage against complexity.

Regards,

Joerg

2020-05-01 13:25:25

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

----- On May 1, 2020, at 12:20 AM, rostedt [email protected] wrote:

> On Thu, 30 Apr 2020 22:26:55 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> The tracers just have to make sure they perform their vmalloc'd memory
>> allocation before registering the tracepoint which can touch it, else they
>> need to issue vmalloc_sync_mappings() on their own before making the
>> newly allocated memory observable by instrumentation.
>
> What gets me is that I added the patch below (which adds a
> vmalloc_sync_mappings() just after the alloc_percpu()), but I also recorded
> all instances of vmalloc() with a stackdump, and I get this:
>
> colord-1673 [002] .... 84.764804: __vmalloc_node_range+0x5/0x2c0: vmalloc
> called here
> colord-1673 [002] .... 84.764807: <stack trace>
> => __ftrace_trace_stack+0x161/0x1a0
> => __vmalloc_node_range+0x4d/0x2c0
> => module_alloc+0x7e/0xd0
> => bpf_jit_binary_alloc+0x70/0x110
> => bpf_int_jit_compile+0x139/0x40a
> => bpf_prog_select_runtime+0xa3/0x120
> => bpf_prepare_filter+0x533/0x5a0
> => sk_attach_filter+0x13/0x50
> => sock_setsockopt+0xd2f/0xf90
> => __sys_setsockopt+0x18a/0x1a0
> => __x64_sys_setsockopt+0x20/0x30
> => do_syscall_64+0x60/0x230
> => entry_SYSCALL_64_after_hwframe+0x49/0xb3
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
>
>
> [ the above is from before the tracing started ]
>
> trace-cmd-1687 [002] .... 103.908850: __vmalloc_node_range+0x5/0x2c0: vmalloc
> called here
> trace-cmd-1687 [002] .... 103.908856: <stack trace>
> => __ftrace_trace_stack+0x161/0x1a0
> => __vmalloc_node_range+0x4d/0x2c0
> => vzalloc+0x48/0x50
> => trace_pid_write+0x23d/0x2b0
> => pid_write.isra.62+0xd1/0x2f0
> => vfs_write+0xa8/0x1b0
> => ksys_write+0x67/0xe0
> => do_syscall_64+0x60/0x230
> => entry_SYSCALL_64_after_hwframe+0x49/0xb3
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> trace-cmd-1697 [003] .... 104.088950: __vmalloc_node_range+0x5/0x2c0: vmalloc
> called here
> trace-cmd-1697 [003] .... 104.088954: <stack trace>
> => __ftrace_trace_stack+0x161/0x1a0
> => __vmalloc_node_range+0x4d/0x2c0
> => vzalloc+0x48/0x50
> => trace_pid_write+0x23d/0x2b0
> => pid_write.isra.62+0xd1/0x2f0
> => vfs_write+0xa8/0x1b0
> => ksys_write+0x67/0xe0
> => do_syscall_64+0x60/0x230
> => entry_SYSCALL_64_after_hwframe+0x49/0xb3
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> trace-cmd-1697 [003] .... 104.089666: __vmalloc_node_range+0x5/0x2c0: vmalloc
> called here
> trace-cmd-1697 [003] .... 104.089669: <stack trace>
> => __ftrace_trace_stack+0x161/0x1a0
> => __vmalloc_node_range+0x4d/0x2c0
> => vzalloc+0x48/0x50
> => trace_pid_write+0xc1/0x2b0
> => pid_write.isra.62+0xd1/0x2f0
> => vfs_write+0xa8/0x1b0
> => ksys_write+0x67/0xe0
> => do_syscall_64+0x60/0x230
> => entry_SYSCALL_64_after_hwframe+0x49/0xb3
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> trace-cmd-1697 [003] .... 104.098920: __vmalloc_node_range+0x5/0x2c0: vmalloc
> called here
> trace-cmd-1697 [003] .... 104.098924: <stack trace>
> => __ftrace_trace_stack+0x161/0x1a0
> => __vmalloc_node_range+0x4d/0x2c0
> => vzalloc+0x48/0x50
> => trace_pid_write+0xc1/0x2b0
> => pid_write.isra.62+0xd1/0x2f0
> => vfs_write+0xa8/0x1b0
> => ksys_write+0x67/0xe0
> => do_syscall_64+0x60/0x230
> => entry_SYSCALL_64_after_hwframe+0x49/0xb3
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> trace-cmd-1697 [003] .... 104.114518: __vmalloc_node_range+0x5/0x2c0: vmalloc
> called here
> trace-cmd-1697 [003] .... 104.114520: <stack trace>
> => __ftrace_trace_stack+0x161/0x1a0
> => __vmalloc_node_range+0x4d/0x2c0
> => vzalloc+0x48/0x50
> => trace_pid_write+0xc1/0x2b0
> => pid_write.isra.62+0xd1/0x2f0
> => vfs_write+0xa8/0x1b0
> => ksys_write+0x67/0xe0
> => do_syscall_64+0x60/0x230
> => entry_SYSCALL_64_after_hwframe+0x49/0xb3
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> trace-cmd-1697 [003] .... 104.130705: __vmalloc_node_range+0x5/0x2c0: vmalloc
> called here
> trace-cmd-1697 [003] .... 104.130707: <stack trace>
> => __ftrace_trace_stack+0x161/0x1a0
> => __vmalloc_node_range+0x4d/0x2c0
> => vzalloc+0x48/0x50
> => trace_pid_write+0x23d/0x2b0
> => event_pid_write.isra.30+0x21b/0x3b0
> => vfs_write+0xa8/0x1b0
> => ksys_write+0x67/0xe0
> => do_syscall_64+0x60/0x230
> => entry_SYSCALL_64_after_hwframe+0x49/0xb3
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> trace-cmd-1687 [001] .... 106.000510: __vmalloc_node_range+0x5/0x2c0: vmalloc
> called here
> trace-cmd-1687 [001] .... 106.000514: <stack trace>
> => __ftrace_trace_stack+0x161/0x1a0
> => __vmalloc_node_range+0x4d/0x2c0
> => vzalloc+0x48/0x50
> => trace_pid_write+0x23d/0x2b0
> => pid_write.isra.62+0xd1/0x2f0
> => vfs_write+0xa8/0x1b0
> => ksys_write+0x67/0xe0
> => do_syscall_64+0x60/0x230
> => entry_SYSCALL_64_after_hwframe+0x49/0xb3
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
> => 0
>
> The above is the calls to adding pids to set_event_pid. (I see I should
> probably make that code a bit more efficient, it calls the vmalloc code a
> bit too much).
>
> But what is missing, is the call to vmalloc from alloc_percpu(). In fact, I
> put in printks in the vmalloc() that's in alloc_percpu() and it doesn't
> trigger from the tracing code, and it does show up in my trace from other
> areas of the kernel:
>
> kworker/1:3-204 [001] .... 42.888340: __vmalloc_node_range+0x5/0x2c0:
> vmalloc called here
> kworker/1:3-204 [001] .... 42.888342: <stack trace>
> => __ftrace_trace_stack+0x161/0x1a0
> => __vmalloc_node_range+0x4d/0x2c0
> => __vmalloc+0x30/0x40
> => pcpu_create_chunk+0x77/0x220
> => pcpu_balance_workfn+0x407/0x650
> => process_one_work+0x25e/0x5c0
> => worker_thread+0x30/0x380
> => kthread+0x139/0x160
> => ret_from_fork+0x3a/0x50
>
> So I'm still not 100% sure why the percpu data is causing a problem?

I suspect that this is simply because alloc_percpu is calling __vmalloc()
to allocate a "chunk" before you even started tracing, possibly early at
boot. Then it happens that your own alloc_percpu allocation fits in an
already vmallocated area which is still "free".

Thanks,

Mathieu

>
> -- Steve
>
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 8d2b98812625..10e4970a150c 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -8486,6 +8486,7 @@ allocate_trace_buffer(struct trace_array *tr, struct
> array_buffer *buf, int size
> return -ENOMEM;
>
> buf->data = alloc_percpu(struct trace_array_cpu);
> + vmalloc_sync_mappings();
> if (!buf->data) {
> ring_buffer_free(buf->buffer);
> buf->buffer = NULL;
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 9a8227afa073..489cf0620edc 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2543,6 +2543,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned
> long align,
> void *addr;
> unsigned long real_size = size;
>
> + trace_printk("vmalloc called here\n");
> + trace_dump_stack(0);
> size = PAGE_ALIGN(size);
> if (!size || (size >> PAGE_SHIFT) > totalram_pages())
> goto fail;

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-05-01 13:38:19

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

----- On Apr 30, 2020, at 10:39 PM, rostedt [email protected] wrote:

> On Thu, 30 Apr 2020 22:26:55 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> ----- On Apr 30, 2020, at 9:13 PM, rostedt [email protected] wrote:
>>
>> > [ Joerg, sending again this time not just to you. (hit reply to sender
>> > and not reply to all). Feel free to resend what you wrote before to this ]
>> >
>> > On Thu, 30 Apr 2020 21:14:34 +0200
>> > Joerg Roedel <[email protected]> wrote:
>> >
>> >> And alloc_percpu() calls down into pcpu_alloc(), which allocates new
>> >> percpu chunks using vmalloc() on x86. And there we are again in the
>> >> vmalloc area.
>> >
>> > So after a vmalloc() is made, should the page tables be synced?
>>
>> Why should it ? Usually, the page fault handler is able to resolve the
>> resulting minor page faults lazily.
>>
>> >
>> > This is a rather subtle bug, and I don't think it should be the caller of
>> > percpu_alloc() that needs to call vmalloc_sync_mappings().
>>
>> Who said tracing was easy ? ;-)
>
> But anyone can hook to a tracepoint, and then if they hook to one that is
> in the page fault handler, and they use vmalloc, they can lock up the
> machine.

True.

>
>>
>> > What's your suggestion for a fix?
>>
>> I know the question is not addressed to me, but here are my 2 cents:
>>
>> It's subtle because ftrace is tracing the page fault handler through
>> tracepoints. It would not make sense to slow down all vmalloc or
>> percpu_alloc() just because tracing recurses when tracing page faults.
>
> What's so damn special about alloc_percpu()? It's definitely not a fast
> path. And it's not used often.

If it's not fastpath, then we might want to do this in pcpu_mem_zalloc()
which is used to allocate the percpu chunks:

static void *pcpu_mem_zalloc(size_t size, gfp_t gfp)
{
if (WARN_ON_ONCE(!slab_is_available()))
return NULL;

if (size <= PAGE_SIZE) {
return kzalloc(size, gfp);
} else {
void *p = __vmalloc(size, gfp | __GFP_ZERO, PAGE_KERNEL);
vmalloc_sync_mappings();
return p;
}
}

However this would not solve the entire problem for vmap(), vmalloc() and
friends.

>
>>
>> I think the right approach to solve this is to call vmalloc_sync_mappings()
>> before any vmalloc'd memory ends up being observable by instrumentation.
>> This can be achieved by adding a vmalloc_sync_mappings call on tracepoint
>> registration like I proposed in my patchset a few week ago:
>>
>> https://lore.kernel.org/r/[email protected]
>>
>> The tracers just have to make sure they perform their vmalloc'd memory
>> allocation before registering the tracepoint which can touch it, else they
>> need to issue vmalloc_sync_mappings() on their own before making the
>> newly allocated memory observable by instrumentation.
>>
>> This approach is not new: register_die_notifier() does exactly that today.
>>
>
> I'll give the answer I gave to Joerg when he replied to my accidental
> private (not public) email:
>
> Or even my original patch would be better than having the generic tracing
> code understanding the intrinsic properties of vmalloc() and
> alloc_percpu() on x86_64. I really don't think it is wise to have:
>
> foo = alloc_percpu();
>
> /*
> * Because of some magic with the way alloc_percpu() works on
> * x86_64, we need to synchronize the pgd of all the tables,
> * otherwise the trace events that happen in x86_64 page fault
> * handlers can't cope with accessing the chance that a
> * alloc_percpu()'d memory might be touched in the page fault trace
> * event. Oh, and we need to audit all alloc_percpu() and vmalloc()
> * calls in tracing, because something might get triggered within a
> * page fault trace event!
> */
> vmalloc_sync_mappings();
>
> That would be exactly what I add as a comment if it were to be added in the
> generic tracing code.

Yeah, I've myself sprinkled vmalloc_sync_all() for years in LTTng to handle
these corner-cases. Note that module memory is vmalloc'd as well!

>
> And we would need to audit any percpu alloc'd code in all tracing, or
> anything that might git hooked into something that hooks to the page fault
> trace point.

Indeed, tracing the page fault handler is really tricky exactly because
of minor faults vs vmalloc'd memory interactions.

>
> Since this worked for a decade without this, I'm strongly against adding it
> in the generic code due to some issues with a single architecture.

I'm really OK with making all this more robust. I think it's however important
that we are careful not slowing down or adding extra memory overhead to
existing workloads as we do this, especially if the only motivation is
to make tracing marginally simpler.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-05-04 15:30:02

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and free_percpu()

On Thu, Apr 30, 2020 at 10:39:19PM -0400, Steven Rostedt wrote:
> What's so damn special about alloc_percpu()? It's definitely not a fast
> path. And it's not used often.

Okay, I fixed it in the percpu code. It is definitly not a nice
solution, but having to call vmalloc_sync_mappings/unmappings() is not a
nice solution at any place in the code. Here is the patch which fixes
this issue for me. I am also not sure what to put in the Fixes tag, as
it is related to tracing code accessing per-cpu data from the page-fault
handler, not sure when this got introduced. Maybe someone else can
provide a meaningful Fixes- or stable tag.

I also have an idea in mind how to make this all more robust and get rid
of the vmalloc_sync_mappings/unmappings() interface, will show more when
I know it works the way I think it does.

Regards,

Joerg

From c616a9a09499f9c9d682775767d3de7db81fb2ed Mon Sep 17 00:00:00 2001
From: Joerg Roedel <[email protected]>
Date: Mon, 4 May 2020 17:11:41 +0200
Subject: [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and
free_percpu()

Sync the vmalloc mappings for all page-tables in the system when
allocating and freeing per-cpu memory. This is necessary for
architectures which use page-faults on vmalloc areas.

The page-fault handlers accesses per-cpu data when tracing is enabled,
and fauling again in the page-fault handler on a vmalloc'ed per-cpu area
will result in a recursive fault.

Signed-off-by: Joerg Roedel <[email protected]>
---
mm/percpu.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/mm/percpu.c b/mm/percpu.c
index d7e3bc649f4e..6ab035bc6977 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1710,6 +1710,20 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
trace_percpu_alloc_percpu(reserved, is_atomic, size, align,
chunk->base_addr, off, ptr);

+ /*
+ * The per-cpu buffers might be allocated in the vmalloc area of the
+ * address space. When the architecture allows faulting on the vmalloc
+ * area and the memory allocated here is accessed in the page-fault
+ * handler, the vmalloc area fault may be recursive and could never be
+ * resolved.
+ * This happens for example in the tracing code which allocates per-cpu
+ * and accesses them when tracing page-faults.
+ * To prevent this, make sure the per-cpu buffers allocated here are
+ * mapped in all PGDs so that the page-fault handler will never fault
+ * again on them.
+ */
+ vmalloc_sync_mappings();
+
return ptr;

fail_unlock:
@@ -1958,6 +1972,12 @@ void free_percpu(void __percpu *ptr)

trace_percpu_free_percpu(chunk->base_addr, off, ptr);

+ /*
+ * See comment at the vmalloc_sync_mappings() call in pcpu_alloc() for
+ * why this is necessary.
+ */
+ vmalloc_sync_unmappings();
+
spin_unlock_irqrestore(&pcpu_lock, flags);

if (need_balance)
--
2.12.3

2020-05-04 15:36:24

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and free_percpu()

----- On May 4, 2020, at 11:12 AM, Joerg Roedel [email protected] wrote:

> On Thu, Apr 30, 2020 at 10:39:19PM -0400, Steven Rostedt wrote:
>> What's so damn special about alloc_percpu()? It's definitely not a fast
>> path. And it's not used often.
>
> Okay, I fixed it in the percpu code. It is definitly not a nice
> solution, but having to call vmalloc_sync_mappings/unmappings() is not a
> nice solution at any place in the code. Here is the patch which fixes
> this issue for me. I am also not sure what to put in the Fixes tag, as
> it is related to tracing code accessing per-cpu data from the page-fault
> handler, not sure when this got introduced. Maybe someone else can
> provide a meaningful Fixes- or stable tag.
>
> I also have an idea in mind how to make this all more robust and get rid
> of the vmalloc_sync_mappings/unmappings() interface, will show more when
> I know it works the way I think it does.
>
> Regards,
>
> Joerg
>
> From c616a9a09499f9c9d682775767d3de7db81fb2ed Mon Sep 17 00:00:00 2001
> From: Joerg Roedel <[email protected]>
> Date: Mon, 4 May 2020 17:11:41 +0200
> Subject: [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and
> free_percpu()
>
> Sync the vmalloc mappings for all page-tables in the system when
> allocating and freeing per-cpu memory. This is necessary for
> architectures which use page-faults on vmalloc areas.
>
> The page-fault handlers accesses per-cpu data when tracing is enabled,
> and fauling again in the page-fault handler on a vmalloc'ed per-cpu area
> will result in a recursive fault.
>
> Signed-off-by: Joerg Roedel <[email protected]>
> ---
> mm/percpu.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index d7e3bc649f4e..6ab035bc6977 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1710,6 +1710,20 @@ static void __percpu *pcpu_alloc(size_t size, size_t
> align, bool reserved,
> trace_percpu_alloc_percpu(reserved, is_atomic, size, align,
> chunk->base_addr, off, ptr);
>
> + /*
> + * The per-cpu buffers might be allocated in the vmalloc area of the
> + * address space. When the architecture allows faulting on the vmalloc
> + * area and the memory allocated here is accessed in the page-fault
> + * handler, the vmalloc area fault may be recursive and could never be
> + * resolved.
> + * This happens for example in the tracing code which allocates per-cpu
> + * and accesses them when tracing page-faults.
> + * To prevent this, make sure the per-cpu buffers allocated here are
> + * mapped in all PGDs so that the page-fault handler will never fault
> + * again on them.
> + */
> + vmalloc_sync_mappings();

Placing this here is inefficient. It syncs mappings for each percpu allocation.
I would recommend moving it right after __vmalloc() is called to allocate the
underlying memory chunk instead:

static void *pcpu_mem_zalloc(size_t size, gfp_t gfp)
{
if (WARN_ON_ONCE(!slab_is_available()))
return NULL;

if (size <= PAGE_SIZE)
return kzalloc(size, gfp);
else {
void *p = __vmalloc(size, gfp | __GFP_ZERO, PAGE_KERNEL);
/* Add comments here ... */
vmalloc_sync_mappings();
return p;
}
}

> +
> return ptr;
>
> fail_unlock:
> @@ -1958,6 +1972,12 @@ void free_percpu(void __percpu *ptr)
>
> trace_percpu_free_percpu(chunk->base_addr, off, ptr);
>
> + /*
> + * See comment at the vmalloc_sync_mappings() call in pcpu_alloc() for
> + * why this is necessary.
> + */
> + vmalloc_sync_unmappings();

I wonder why we'd ever need to explicitly invoke vmalloc_sync_unmappings().
Leaving a stale PTE mapping in place to be lazily unmapped does not seem to
hurt even the tracing use-cases. Why add this call to vmalloc_sync_unmappings()
at all ?

*If* this ends up being needed, it should be moved to:

static void pcpu_mem_free(void *ptr)
{
/* Add comments here... */
if (is_vmalloc_addr(ptr))
vmalloc_sync_unmappings();
kvfree(ptr);
}

So it is only called before the underlying vmalloc'd chunk is freed, rather than
at each and every percpu free.

Thanks,

Mathieu


> +
> spin_unlock_irqrestore(&pcpu_lock, flags);
>
> if (need_balance)
> --
> 2.12.3

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-05-04 15:38:54

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and free_percpu()

On Mon, May 04, 2020 at 11:28:46AM -0400, Mathieu Desnoyers wrote:
> ----- On May 4, 2020, at 11:12 AM, Joerg Roedel [email protected] wrote:
> Placing this here is inefficient. It syncs mappings for each percpu allocation.
> I would recommend moving it right after __vmalloc() is called to allocate the
> underlying memory chunk instead:
>
> static void *pcpu_mem_zalloc(size_t size, gfp_t gfp)
> {

Tried this before, actually I put it into the caller of
pcpu_mem_zalloc(), but that didn't fix the problem for me. Stevens
test-case still hangs the machine.


Regards,

Joerg

2020-05-04 15:43:31

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and free_percpu()

----- On May 4, 2020, at 11:31 AM, Joerg Roedel [email protected] wrote:

> On Mon, May 04, 2020 at 11:28:46AM -0400, Mathieu Desnoyers wrote:
>> ----- On May 4, 2020, at 11:12 AM, Joerg Roedel [email protected] wrote:
>> Placing this here is inefficient. It syncs mappings for each percpu allocation.
>> I would recommend moving it right after __vmalloc() is called to allocate the
>> underlying memory chunk instead:
>>
>> static void *pcpu_mem_zalloc(size_t size, gfp_t gfp)
>> {
>
> Tried this before, actually I put it into the caller of
> pcpu_mem_zalloc(), but that didn't fix the problem for me. Stevens
> test-case still hangs the machine.

That's unexpected.

Did you confirm that those hangs were also caused by percpu allocations ?

Maybe adding the vmalloc_sync_mappings() at each percpu allocation happens
to luckily sync mappings after some other vmalloc.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-05-04 17:10:00

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and free_percpu()

On Mon, 4 May 2020 11:38:43 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:

> ----- On May 4, 2020, at 11:31 AM, Joerg Roedel [email protected] wrote:
>
> > On Mon, May 04, 2020 at 11:28:46AM -0400, Mathieu Desnoyers wrote:
> >> ----- On May 4, 2020, at 11:12 AM, Joerg Roedel [email protected] wrote:
> >> Placing this here is inefficient. It syncs mappings for each percpu allocation.
> >> I would recommend moving it right after __vmalloc() is called to allocate the
> >> underlying memory chunk instead:
> >>
> >> static void *pcpu_mem_zalloc(size_t size, gfp_t gfp)
> >> {
> >
> > Tried this before, actually I put it into the caller of
> > pcpu_mem_zalloc(), but that didn't fix the problem for me. Stevens
> > test-case still hangs the machine.
>
> That's unexpected.
>
> Did you confirm that those hangs were also caused by percpu allocations ?
>
> Maybe adding the vmalloc_sync_mappings() at each percpu allocation happens
> to luckily sync mappings after some other vmalloc.
>

It doesn't surprise me because my alloc_percpu() call never gets to that
path. But systemd does hit it for me earlier on.

-- Steve

2020-05-04 17:44:03

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and free_percpu()

On Mon, 4 May 2020 17:12:36 +0200
Joerg Roedel <[email protected]> wrote:

> On Thu, Apr 30, 2020 at 10:39:19PM -0400, Steven Rostedt wrote:
> > What's so damn special about alloc_percpu()? It's definitely not a fast
> > path. And it's not used often.
>
> Okay, I fixed it in the percpu code. It is definitly not a nice
> solution, but having to call vmalloc_sync_mappings/unmappings() is not a
> nice solution at any place in the code. Here is the patch which fixes
> this issue for me. I am also not sure what to put in the Fixes tag, as
> it is related to tracing code accessing per-cpu data from the page-fault
> handler, not sure when this got introduced. Maybe someone else can
> provide a meaningful Fixes- or stable tag.
>
> I also have an idea in mind how to make this all more robust and get rid
> of the vmalloc_sync_mappings/unmappings() interface, will show more when
> I know it works the way I think it does.
>
>

Seems that your patch caused a lockdep splat on my box:

========================================================
WARNING: possible irq lock inversion dependency detected
5.7.0-rc3-test+ #249 Not tainted
--------------------------------------------------------
swapper/4/0 just changed the state of lock:
ffff9a580fdd75a0 (&ndev->lock){++.-}-{2:2}, at: mld_ifc_timer_expire+0x3c/0x350
but this lock took another, SOFTIRQ-unsafe lock in the past:
(pgd_lock){+.+.}-{2:2}


and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(pgd_lock);
local_irq_disable();
lock(&ndev->lock);
lock(pgd_lock);
<Interrupt>
lock(&ndev->lock);

*** DEADLOCK ***

1 lock held by swapper/4/0:
#0: ffff9a581ab05e70 ((&idev->mc_ifc_timer)){+.-.}-{0:0}, at: call_timer_fn+0x5/0x2f0

the shortest dependencies between 2nd lock and 1st lock:
-> (pgd_lock){+.+.}-{2:2} {
HARDIRQ-ON-W at:
lock_acquire+0xda/0x3d0
_raw_spin_lock+0x2f/0x40
sync_global_pgds_l4+0x77/0x180
pcpu_alloc+0x1fd/0x7b0
__kmem_cache_create+0x358/0x540
create_cache+0xe1/0x1f0
kmem_cache_create_usercopy+0x1a5/0x270
kmem_cache_create+0x12/0x20
acpi_os_create_cache+0x18/0x30
acpi_ut_create_caches+0x47/0xab
acpi_ut_init_globals+0xa/0x21a
acpi_initialize_subsystem+0x30/0xa5
acpi_early_init+0x62/0xd6
start_kernel+0x797/0x86a
secondary_startup_64+0xa4/0xb0
SOFTIRQ-ON-W at:
lock_acquire+0xda/0x3d0
_raw_spin_lock+0x2f/0x40
sync_global_pgds_l4+0x77/0x180
pcpu_alloc+0x1fd/0x7b0
__kmem_cache_create+0x358/0x540
create_cache+0xe1/0x1f0
kmem_cache_create_usercopy+0x1a5/0x270
kmem_cache_create+0x12/0x20
acpi_os_create_cache+0x18/0x30
acpi_ut_create_caches+0x47/0xab
acpi_ut_init_globals+0xa/0x21a
acpi_initialize_subsystem+0x30/0xa5
acpi_early_init+0x62/0xd6
start_kernel+0x797/0x86a
secondary_startup_64+0xa4/0xb0
INITIAL USE at:
}
... key at: [<ffffffffb96340b8>] pgd_lock+0x18/0x40
... acquired at:
_raw_spin_lock+0x2f/0x40
sync_global_pgds_l4+0x77/0x180
pcpu_alloc+0x1fd/0x7b0
fib_nh_common_init+0x53/0x110
fib6_nh_init+0x10c/0x700
ip6_route_info_create+0x344/0x440
ip6_route_add+0x18/0x90
addrconf_prefix_route.isra.48+0x17b/0x210
addrconf_notify+0x743/0x8c0
notifier_call_chain+0x47/0x70
__dev_notify_flags+0x9d/0x150
dev_change_flags+0x48/0x60
do_setlink+0x39d/0x1080
rtnl_setlink+0x116/0x190
rtnetlink_rcv_msg+0x188/0x4b0
netlink_rcv_skb+0x75/0x140
netlink_unicast+0x1ae/0x280
netlink_sendmsg+0x253/0x490
sock_sendmsg+0x5b/0x60
__sys_sendto+0x12c/0x190
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x60/0x230
entry_SYSCALL_64_after_hwframe+0x49/0xb3

-> (&ndev->lock){++.-}-{2:2} {
HARDIRQ-ON-W at:
lock_acquire+0xda/0x3d0
_raw_write_lock_bh+0x34/0x40
ipv6_mc_init_dev+0x19/0xc0
ipv6_add_dev+0x2e5/0x490
addrconf_init+0x7f/0x250
inet6_init+0x1c3/0x373
do_one_initcall+0x70/0x340
kernel_init_freeable+0x249/0x2ca
kernel_init+0xa/0x10a
ret_from_fork+0x3a/0x50
HARDIRQ-ON-R at:
lock_acquire+0xda/0x3d0
_raw_read_lock_bh+0x37/0x50
addrconf_dad_work+0xc6/0x560
process_one_work+0x25e/0x5c0
worker_thread+0x30/0x380
kthread+0x139/0x160
ret_from_fork+0x3a/0x50
IN-SOFTIRQ-R at:
lock_acquire+0xda/0x3d0
_raw_read_lock_bh+0x37/0x50
mld_ifc_timer_expire+0x3c/0x350
call_timer_fn+0xa5/0x2f0
run_timer_softirq+0x1dd/0x580
__do_softirq+0xf8/0x4be
irq_exit+0xf1/0x100
smp_apic_timer_interrupt+0xd0/0x2a0
apic_timer_interrupt+0xf/0x20
cpuidle_enter_state+0xcd/0x440
cpuidle_enter+0x29/0x40
do_idle+0x24a/0x290
cpu_startup_entry+0x19/0x20
start_secondary+0x195/0x1e0
secondary_startup_64+0xa4/0xb0
INITIAL USE at:
lock_acquire+0xda/0x3d0
_raw_write_lock_bh+0x34/0x40
ipv6_mc_init_dev+0x19/0xc0
ipv6_add_dev+0x2e5/0x490
addrconf_init+0x7f/0x250
inet6_init+0x1c3/0x373
do_one_initcall+0x70/0x340
kernel_init_freeable+0x249/0x2ca
kernel_init+0xa/0x10a
ret_from_fork+0x3a/0x50
}
... key at: [<ffffffffbaf727f0>] __key.78650+0x0/0x10
... acquired at:
mark_lock+0x22e/0x740
__lock_acquire+0x9e1/0x1c30
lock_acquire+0xda/0x3d0
_raw_read_lock_bh+0x37/0x50
mld_ifc_timer_expire+0x3c/0x350
call_timer_fn+0xa5/0x2f0
run_timer_softirq+0x1dd/0x580
__do_softirq+0xf8/0x4be
irq_exit+0xf1/0x100
smp_apic_timer_interrupt+0xd0/0x2a0
apic_timer_interrupt+0xf/0x20
cpuidle_enter_state+0xcd/0x440
cpuidle_enter+0x29/0x40
do_idle+0x24a/0x290
cpu_startup_entry+0x19/0x20
start_secondary+0x195/0x1e0
secondary_startup_64+0xa4/0xb0


stack backtrace:
CPU: 4 PID: 0 Comm: swapper/4 Not tainted 5.7.0-rc3-test+ #249
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
Call Trace:
<IRQ>
dump_stack+0x8f/0xd0
check_usage_forwards.cold.61+0x1e/0x27
mark_lock+0x22e/0x740
? check_usage_backwards+0x1e0/0x1e0
__lock_acquire+0x9e1/0x1c30
lock_acquire+0xda/0x3d0
? mld_ifc_timer_expire+0x3c/0x350
? mld_dad_timer_expire+0xb0/0xb0
? mld_dad_timer_expire+0xb0/0xb0
_raw_read_lock_bh+0x37/0x50
? mld_ifc_timer_expire+0x3c/0x350
mld_ifc_timer_expire+0x3c/0x350
? mld_dad_timer_expire+0xb0/0xb0
? mld_dad_timer_expire+0xb0/0xb0
call_timer_fn+0xa5/0x2f0
? mld_dad_timer_expire+0xb0/0xb0
run_timer_softirq+0x1dd/0x580
__do_softirq+0xf8/0x4be
irq_exit+0xf1/0x100
smp_apic_timer_interrupt+0xd0/0x2a0
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:cpuidle_enter_state+0xcd/0x440
Code: 80 7c 24 13 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 0c 03 00 00 31 ff e8 6f 35 8b ff e8 1a 52 92 ff fb 66 0f 1f 44 00 00 <85> ed 0f 88 74 02 00 00 48 63 c5 4c 8b 3c 24 4c 2b 7c 24 08 48 8d
RSP: 0018:ffff9a581981fe70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 0000000000e2cf41 RBX: ffff9a581ab37400 RCX: 0000000000000000
RDX: ffff9a581982d100 RSI: 0000000000000006 RDI: ffff9a581982d100
RBP: 0000000000000004 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffb96f14c0
R13: ffffffffb96f1678 R14: 0000000000000004 R15: 0000000000000004
cpuidle_enter+0x29/0x40
do_idle+0x24a/0x290
cpu_startup_entry+0x19/0x20
start_secondary+0x195/0x1e0
secondary_startup_64+0xa4/0xb0


-- Steve

2020-05-04 18:44:26

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and free_percpu()

On Mon, May 04, 2020 at 11:38:43AM -0400, Mathieu Desnoyers wrote:
> ----- On May 4, 2020, at 11:31 AM, Joerg Roedel [email protected] wrote:
> > Tried this before, actually I put it into the caller of
> > pcpu_mem_zalloc(), but that didn't fix the problem for me. Stevens
> > test-case still hangs the machine.
>
> That's unexpected.
>
> Did you confirm that those hangs were also caused by percpu allocations ?

No, I didn't dig deeper on that hang. I can probably do tomorrow.


Regards,

Joerg

2020-05-04 18:50:15

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and free_percpu()

On Mon, May 04, 2020 at 01:40:42PM -0400, Steven Rostedt wrote:
> Seems that your patch caused a lockdep splat on my box:
>
> ========================================================
> WARNING: possible irq lock inversion dependency detected
> 5.7.0-rc3-test+ #249 Not tainted
> --------------------------------------------------------
> swapper/4/0 just changed the state of lock:
> ffff9a580fdd75a0 (&ndev->lock){++.-}-{2:2}, at: mld_ifc_timer_expire+0x3c/0x350
> but this lock took another, SOFTIRQ-unsafe lock in the past:
> (pgd_lock){+.+.}-{2:2}
>
>
> and interrupts could create inverse lock ordering between them.
>
>
> other info that might help us debug this:
> Possible interrupt unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(pgd_lock);
> local_irq_disable();
> lock(&ndev->lock);
> lock(pgd_lock);
> <Interrupt>
> lock(&ndev->lock);
>
> *** DEADLOCK ***

Fair point, but this just shows how problematic it is to call something
like vmalloc_sync_mappings() from a lower-level kernel API function.
The obvious fix for this would be to make pgd_lock irq-safe, but this is
getting more and more ridiculous.

I know you don't like to have a vmalloc_sync_mappings() call in the
tracing code, but can you live with it until we get rid of this broken
interface?

My plan for this is to use a small bitmap to track in the vmalloc and
the (x86-)ioremap code at which levels of the page-tables the code made
changes and combine that with an architecture-dependend mask to decide
whether anything needs to be synced.

On x86-64 the sync would be necessary at most 64 times after boot, so I
think this will only have a very small performance impact, even with
VMAP_STACKS. And as a bonus it would also get rid of vmalloc faulting on
x86, fixing the issue with tracing too.

Regards,

Joerg

2020-05-04 19:11:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and free_percpu()

On Mon, 4 May 2020 20:38:32 +0200
Joerg Roedel <[email protected]> wrote:

> Fair point, but this just shows how problematic it is to call something
> like vmalloc_sync_mappings() from a lower-level kernel API function.
> The obvious fix for this would be to make pgd_lock irq-safe, but this is
> getting more and more ridiculous.

Yeah, it's just getting more deeper into the rabbit hole.

>
> I know you don't like to have a vmalloc_sync_mappings() call in the
> tracing code, but can you live with it until we get rid of this broken
> interface?

I'm fine with adding it to the tracing code (with that ridiculous
comment! ;-)

I'll even tag is as stable, but again, it's uncertain what commit that it
"fixes".

>
> My plan for this is to use a small bitmap to track in the vmalloc and
> the (x86-)ioremap code at which levels of the page-tables the code made
> changes and combine that with an architecture-dependend mask to decide
> whether anything needs to be synced.
>
> On x86-64 the sync would be necessary at most 64 times after boot, so I
> think this will only have a very small performance impact, even with
> VMAP_STACKS. And as a bonus it would also get rid of vmalloc faulting on
> x86, fixing the issue with tracing too.

Just because tracing get's a workaround, let's hopefully not get distracted
and forgot to add this. It looks like the best solution is to not need the
vmalloc_sync_mappings() call in the future. But, as we are about to release
trace-cmd with a feature that reliably triggers the bug, I can't wait for
the proper fix.

-- Steve

2020-05-04 20:30:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and free_percpu()

On Mon, May 04, 2020 at 05:12:36PM +0200, Joerg Roedel wrote:
> On Thu, Apr 30, 2020 at 10:39:19PM -0400, Steven Rostedt wrote:
> > What's so damn special about alloc_percpu()? It's definitely not a fast
> > path. And it's not used often.
>
> Okay, I fixed it in the percpu code. It is definitly not a nice
> solution, but having to call vmalloc_sync_mappings/unmappings() is not a
> nice solution at any place in the code. Here is the patch which fixes
> this issue for me. I am also not sure what to put in the Fixes tag, as
> it is related to tracing code accessing per-cpu data from the page-fault
> handler, not sure when this got introduced. Maybe someone else can
> provide a meaningful Fixes- or stable tag.
>
> I also have an idea in mind how to make this all more robust and get rid
> of the vmalloc_sync_mappings/unmappings() interface, will show more when
> I know it works the way I think it does.

I'm only following this with half an eye atm, but one solution is to
pull the vmalloc fault out of the trace path entirely.

With the current code that is 'interesting', but with the pile of
patches Thomas is sitting on that would be quite feasible.

2020-05-04 20:46:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and free_percpu()

On Mon, 4 May 2020 22:25:40 +0200
Peter Zijlstra <[email protected]> wrote:

> I'm only following this with half an eye atm, but one solution is to
> pull the vmalloc fault out of the trace path entirely.
>
> With the current code that is 'interesting', but with the pile of
> patches Thomas is sitting on that would be quite feasible.

As I assume that's not something that would be backported to the stable
kernels, I'm still going to add the workaround.

But getting rid of vmalloc_sync_mappings() altogether is still a good idea
IMO.

-- Steve

2020-05-05 12:33:39

by Joerg Roedel

[permalink] [raw]
Subject: [PATCH] tracing: Call vmalloc_sync_mappings() after alloc_percpu()

On Mon, May 04, 2020 at 03:10:06PM -0400, Steven Rostedt wrote:
> I'm fine with adding it to the tracing code (with that ridiculous
> comment! ;-)
>
> I'll even tag is as stable, but again, it's uncertain what commit that it
> "fixes".

Okay, so here it is. It fixes the issue for me and doesn't cause any
lockdep splats on my machine. I am not sure how far this needs to be
backported via stable, so I didn't specify it. Same with a Fixes tag.
Oh, and I added your comment :-) (which makes up most of the added
lines, so feel free to change authorship to you).

Regards,

Joerg

From a265290761dc9d87d35892dc821fd0f5c99f8b34 Mon Sep 17 00:00:00 2001
From: Joerg Roedel <[email protected]>
Date: Tue, 5 May 2020 14:17:57 +0200
Subject: [PATCH] tracing: Call vmalloc_sync_mappings() after alloc_percpu()

Tracing code allocates percpu memory which is touched when trace events
happen. When the events happen in the page-fault handler and per-cpu
memory needs to be faulted in, the page-fault on per-cpu memory might
never complete.

Call vmalloc_sync_mappings() after allocating the runtime per-cpu
buffers to make sure they are mapped when the page-fault handler tries
to access them.

This is not a nice solution, but the best fix until the
vmalloc_sync_mappings/unmappings() interface can be ripped out of the
kernel.

Cc: [email protected]
Signed-off-by: Joerg Roedel <[email protected]>
---
kernel/trace/trace.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8d2b98812625..0fc56063fcca 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8525,6 +8525,19 @@ static int allocate_trace_buffers(struct trace_array *tr, int size)
*/
allocate_snapshot = false;
#endif
+
+ /*
+ * Because of some magic with the way alloc_percpu() works on
+ * x86_64, we need to synchronize the pgd of all the tables,
+ * otherwise the trace events that happen in x86_64 page fault
+ * handlers can't cope with accessing the chance that a
+ * alloc_percpu()'d memory might be touched in the page fault trace
+ * event. Oh, and we need to audit all alloc_percpu() and vmalloc()
+ * calls in tracing, because something might get triggered within a
+ * page fault trace event!
+ */
+ vmalloc_sync_mappings();
+
return 0;
}

--
2.12.3



2020-05-06 15:21:26

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] tracing: Call vmalloc_sync_mappings() after alloc_percpu()

On Tue, 5 May 2020 14:31:44 +0200
Joerg Roedel <[email protected]> wrote:

> On Mon, May 04, 2020 at 03:10:06PM -0400, Steven Rostedt wrote:
> > I'm fine with adding it to the tracing code (with that ridiculous
> > comment! ;-)
> >
> > I'll even tag is as stable, but again, it's uncertain what commit that it
> > "fixes".
>
> Okay, so here it is. It fixes the issue for me and doesn't cause any
> lockdep splats on my machine. I am not sure how far this needs to be
> backported via stable, so I didn't specify it. Same with a Fixes tag.
> Oh, and I added your comment :-) (which makes up most of the added
> lines, so feel free to change authorship to you).

OK, I did take authorship. At least it will be me who gets the blame ;-)

Here's the patch:

From 685fc0b545f39bf9bfafa0396113c852ef3c81fc Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (VMware)" <[email protected]>
Date: Wed, 6 May 2020 10:36:18 -0400
Subject: [PATCH] tracing: Add a vmalloc_sync_mappings() for safe measure

x86_64 lazily maps in the vmalloc pages, and the way this works with per_cpu
areas can be complex, to say the least. Mappings may happen at boot up, and
if nothing synchronizes the page tables, those page mappings may not be
synced till they are used. This causes issues for anything that might touch
one of those mappings in the path of the page fault handler. When one of
those unmapped mappings is touched in the page fault handler, it will cause
another page fault, which in turn will cause a page fault, and leave us in
a loop of page faults.

Commit 763802b53a42 ("x86/mm: split vmalloc_sync_all()") split
vmalloc_sync_all() into vmalloc_sync_unmappings() and
vmalloc_sync_mappings(), as on system exit, it did not need to do a full
sync on x86_64 (although it still needed to be done on x86_32). By chance,
the vmalloc_sync_all() would synchronize the page mappings done at boot up
and prevent the per cpu area from being a problem for tracing in the page
fault handler. But when that synchronization in the exit of a task became a
nop, it caused the problem to appear.

Link: https://lore.kernel.org/r/[email protected]

Cc: [email protected]
Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code")
Reported-by: "Tzvetomir Stoyanov (VMware)" <[email protected]>
Suggested-by: Joerg Roedel <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
---
kernel/trace/trace.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8d2b98812625..9ed6d92768af 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8525,6 +8525,19 @@ static int allocate_trace_buffers(struct trace_array *tr, int size)
*/
allocate_snapshot = false;
#endif
+
+ /*
+ * Because of some magic with the way alloc_percpu() works on
+ * x86_64, we need to synchronize the pgd of all the tables,
+ * otherwise the trace events that happen in x86_64 page fault
+ * handlers can't cope with accessing the chance that a
+ * alloc_percpu()'d memory might be touched in the page fault trace
+ * event. Oh, and we need to audit all other alloc_percpu() and vmalloc()
+ * calls in tracing, because something might get triggered within a
+ * page fault trace event!
+ */
+ vmalloc_sync_mappings();
+
return 0;
}

--
2.20.1

2020-05-08 14:46:21

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH] tracing: Call vmalloc_sync_mappings() after alloc_percpu()

On Wed, May 06, 2020 at 11:17:04AM -0400, Steven Rostedt wrote:
> OK, I did take authorship. At least it will be me who gets the blame ;-)

Okay, I meanwhile worked on the removal of vmalloc_sync_(un)mappings.
Let's see what people think about it.

Regards,

Joerg