LinuxLists.cc - Re: hit a KASan bug related to Perf during stress test

2016-10-24 09:53:47

Subject: Re: hit a KASan bug related to Perf during stress test

On Mon, Oct 24, 2016 at 09:35:46AM +0000, Ni, BaoleX wrote:
>
> [32736.018823] BUG: KASan: use after free in task_tgid_nr_ns+0x35/0xb0 at addr ffff8800265568c0
> [32736.028309] Read of size 8 by task dumpsys/11268
> [32736.033511] =============================================================================
> [32736.042700] BUG task_struct (Tainted: G W O): kasan: bad access detected

'W' this wasn't the first WARN you got, this means this might be the
result of prior borkage.

Also, it says: "BUG task_struct", does that mean task_struct was the
object accessed after free?

> [32736.051002] -----------------------------------------------------------------------------
> [32736.051002]
> [32736.061840] Disabling lock debugging due to kernel taint
> [32736.067830] INFO: Slab 0xffffea0000995400 objects=5 used=3 fp=0xffff880026550000 flags=0x4000000000004080
> [32736.078572] INFO: Object 0xffff880026556440 @offset=25664 fp=0x (null)
> ...
> [32738.776936] CPU: 0 PID: 11268 Comm: dumpsys Tainted: G B W O 3.14.70-x86_64-02260-g162539f #1
> [32738.787092] Hardware name: Insyde CherryTrail/T3 MRD, BIOS CHTMRD.A6.002.016 09/20/2016
> [32738.796082] ffff880026550000 0000000000000086 0000000000000000 ffff880065e05a70
> [32738.796215] ffffffff81fc9427 ffff880065803b40 ffff880026556440 ffff880065e05aa0
> [32738.796345] ffffffff8123fe2d ffff880065803b40 ffffea0000995400 ffff880026556440
> [32738.796475] Call Trace:
> [32738.796510] <NMI>
> [32738.796585] [<ffffffff81fc9427>] dump_stack+0x67/0x90
> [32738.802404] [<ffffffff8123fe2d>] print_trailer+0xfd/0x170
> [32738.808603] [<ffffffff81244f26>] object_err+0x36/0x40
> [32738.814417] [<ffffffff812467ed>] kasan_report_error+0x1fd/0x3d0
> [32738.821193] [<ffffffff81131b84>] ? __rcu_read_unlock+0x24/0x90
> [32738.827881] [<ffffffff81fe0888>] ? preempt_count_sub+0x18/0xf0
> [32738.834565] [<ffffffff811db32c>] ? perf_output_put_handle+0x5c/0x170
> [32738.841833] [<ffffffff81246e70>] kasan_report+0x40/0x50
> [32738.847838] [<ffffffff810d9975>] ? task_tgid_nr_ns+0x35/0xb0
> [32738.854327] [<ffffffff81245d59>] __asan_load8+0x69/0xa0
> [32738.860333] [<ffffffff811dba18>] ? perf_output_copy+0x88/0x120
> [32738.867020] [<ffffffff810d9975>] task_tgid_nr_ns+0x35/0xb0

So here we did: perf_event_[pt]id(event, current);

How can _current_ not be valid anymore?

> [32738.873319] [<ffffffff811cd5d8>] __perf_event_header__init_id+0xb8/0x200
> [32738.880970] [<ffffffff811d6f19>] perf_prepare_sample+0xa9/0x4a0
> [32738.887754] [<ffffffff811d7700>] __perf_event_overflow+0x3f0/0x460
> [32738.894835] [<ffffffff81022998>] ? x86_perf_event_set_period+0x128/0x210
> [32738.902496] [<ffffffff811d8494>] perf_event_overflow+0x14/0x20
> [32738.909180] [<ffffffff8102cabc>] intel_pmu_handle_irq+0x25c/0x520
> [32738.916156] [<ffffffff81245945>] ? __asan_store8+0x15/0xa0
> [32738.922460] [<ffffffff81fddb8b>] perf_event_nmi_handler+0x2b/0x50
> [32738.929437] [<ffffffff81fdd4a8>] nmi_handle+0x88/0x230
> [32738.935346] [<ffffffff81009873>] do_nmi+0x193/0x490
> [32738.940963] [<ffffffff81fdc6d6>] end_repeat_nmi+0x1a/0x1e
> [32738.947163] [<ffffffff81245d22>] ? __asan_load8+0x32/0xa0
> [32738.953358] [<ffffffff81245d22>] ? __asan_load8+0x32/0xa0
> [32738.959554] [<ffffffff81245d22>] ? __asan_load8+0x32/0xa0
> [32738.965718] <<EOE>>
> [32738.965787] [<ffffffff811065a2>] ? check_preempt_wakeup+0x1a2/0x3a0
> [32738.972970] [<ffffffff810f4618>] check_preempt_curr+0xf8/0x120
> [32738.979658] [<ffffffff810f465d>] ttwu_do_wakeup+0x1d/0x1b0
> [32738.985953] [<ffffffff810f4909>] ttwu_do_activate.constprop.105+0x89/0x90
> [32738.993710] [<ffffffff810f87fe>] try_to_wake_up+0x29e/0x4e0
> [32739.000100] [<ffffffff810f8aaf>] default_wake_function+0x2f/0x40
> [32739.006979] [<ffffffff81114338>] autoremove_wake_function+0x18/0x50
> [32739.014149] [<ffffffff81fe0888>] ? preempt_count_sub+0x18/0xf0
> [32739.020836] [<ffffffff81113ab9>] __wake_up_common+0x79/0xb0
> [32739.027232] [<ffffffff81113d69>] __wake_up+0x39/0x50
> [32739.032945] [<ffffffff81135918>] __call_rcu_nocb_enqueue+0x158/0x160
> [32739.040207] [<ffffffff81135a4c>] __call_rcu+0x12c/0x450

And while we just called release_task(), that call_rcu() should still be
pending at this point, also I don't think that can be current until
after do_task_dead() where we schedule away from the dead task and
change current.

> [32739.046207] [<ffffffff81135dcd>] call_rcu+0x1d/0x20
> [32739.051821] [<ffffffff810ae2da>] release_task+0x6aa/0x8d0
> [32739.058022] [<ffffffff8111e86f>] ? do_raw_write_unlock+0x6f/0xd0
> [32739.064900] [<ffffffff810b1002>] do_exit+0xe52/0x1020
> [32739.070712] [<ffffffff810b1222>] SyS_exit+0x22/0x30
> [32739.076328] [<ffffffff81fe9063>] sysenter_dispatch+0x7/0x1f
> [32739.082725] [<ffffffff8152f33b>] ? trace_hardirqs_on_thunk+0x3a/0x3c

Oleg, any idea?

2016-10-24 11:17:00

by Oleg Nesterov

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On 10/24, Peter Zijlstra wrote:
>
> > [32738.867020] [<ffffffff810d9975>] task_tgid_nr_ns+0x35/0xb0
>
> So here we did: perf_event_[pt]id(event, current);
>
> How can _current_ not be valid anymore?

...

> > [32739.040207] [<ffffffff81135a4c>] __call_rcu+0x12c/0x450
>
> And while we just called release_task(), that call_rcu() should still be
> pending at this point,

Yes, current is still valid.

But nothing protects current->group_leader or parent/real_parent, they
can point to the exited/freed task. We really need to nullify them in
__unhash_process() to catch the problems like this, I wanted to do this
many times...

So you simply can't know your tgid or even tid after release_task() calls
__unhash_process(). Actually after exit_notify() unless the exiting task
autoreaps itself.

How about the trivial fix below?

Oleg.

--- x/kernel/events/core.c
+++ x/kernel/events/core.c
@@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
if (event->parent)
event = event->parent;

- return task_tgid_nr_ns(p, event->ns);
+ return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
}

static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)

2016-10-24 11:24:08

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> On 10/24, Peter Zijlstra wrote:
> >
> > > [32738.867020] [<ffffffff810d9975>] task_tgid_nr_ns+0x35/0xb0
> >
> > So here we did: perf_event_[pt]id(event, current);
> >
> > How can _current_ not be valid anymore?
>
> ...
>
> > > [32739.040207] [<ffffffff81135a4c>] __call_rcu+0x12c/0x450
> >
> > And while we just called release_task(), that call_rcu() should still be
> > pending at this point,
>
> Yes, current is still valid.
>
> But nothing protects current->group_leader or parent/real_parent, they
> can point to the exited/freed task. We really need to nullify them in
> __unhash_process() to catch the problems like this, I wanted to do this
> many times...
>
> So you simply can't know your tgid or even tid after release_task() calls
> __unhash_process(). Actually after exit_notify() unless the exiting task
> autoreaps itself.
>
> How about the trivial fix below?
>
> Oleg.
>
> --- x/kernel/events/core.c
> +++ x/kernel/events/core.c
> @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
> if (event->parent)
> event = event->parent;
>
> - return task_tgid_nr_ns(p, event->ns);
> + return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
> }

Hurm.. should we not push this into task_tgid_nr_ns() ? I mean, now the
user needs to be aware of this dinky detail.

2016-10-24 11:27:36

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> How about the trivial fix below?
>
> Oleg.
>
> --- x/kernel/events/core.c
> +++ x/kernel/events/core.c
> @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
> if (event->parent)
> event = event->parent;
>
> - return task_tgid_nr_ns(p, event->ns);
> + return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
> }

Also, now we get a (few) sample(s) with a different pid:tid than prior
samples and not matching the sched_switch() events.

I can imagine that being somewhat confusing for people/tools.

Acme/Jolsa, any idea if that will bugger perf-report?

2016-10-24 11:29:53

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On Mon, Oct 24, 2016 at 01:27:32PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> > How about the trivial fix below?
> >
> > Oleg.
> >
> > --- x/kernel/events/core.c
> > +++ x/kernel/events/core.c
> > @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
> > if (event->parent)
> > event = event->parent;
> >
> > - return task_tgid_nr_ns(p, event->ns);
> > + return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
> > }
>
> Also, now we get a (few) sample(s) with a different pid:tid than prior
> samples and not matching the sched_switch() events.
>
> I can imagine that being somewhat confusing for people/tools.
>
> Acme/Jolsa, any idea if that will bugger perf-report?

Hurm, then again, I imagine that after unhash_process the PID/TID could
be instantly re-used and then we're still confused.

Yuck.

2016-10-24 12:04:07

by Oleg Nesterov

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On 10/24, Peter Zijlstra wrote:
>
> On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> >
> > --- x/kernel/events/core.c
> > +++ x/kernel/events/core.c
> > @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
> > if (event->parent)
> > event = event->parent;
> >
> > - return task_tgid_nr_ns(p, event->ns);
> > + return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
> > }
>
> Hurm.. should we not push this into task_tgid_nr_ns() ? I mean, now the
> user needs to be aware of this dinky detail.

Perhaps. Or into task_tgid(). Or even the patch below, __task_pid_nr_ns()
is always safe. This certainly needs some cleanups.

Oleg.

--- x/include/linux/pid.h
+++ x/include/linux/pid.h
@@ -8,7 +8,8 @@ enum pid_type
PIDTYPE_PID,
PIDTYPE_PGID,
PIDTYPE_SID,
- PIDTYPE_MAX
+ PIDTYPE_MAX,
+ PIDTYPE_TGID /* do not use */
};

/*
--- x/kernel/pid.c
+++ x/kernel/pid.c
@@ -538,7 +538,7 @@ EXPORT_SYMBOL(__task_pid_nr_ns);

pid_t task_tgid_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
{
- return pid_nr_ns(task_tgid(tsk), ns);
+ return __task_pid_nr_ns(tsk, PIDTYPE_TGID, ns);
}
EXPORT_SYMBOL(task_tgid_nr_ns);

2016-10-24 12:04:17

by Jiri Olsa

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On Mon, Oct 24, 2016 at 01:29:45PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 24, 2016 at 01:27:32PM +0200, Peter Zijlstra wrote:
> > On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> > > How about the trivial fix below?
> > >
> > > Oleg.
> > >
> > > --- x/kernel/events/core.c
> > > +++ x/kernel/events/core.c
> > > @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
> > > if (event->parent)
> > > event = event->parent;
> > >
> > > - return task_tgid_nr_ns(p, event->ns);
> > > + return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
> > > }
> >
> > Also, now we get a (few) sample(s) with a different pid:tid than prior
> > samples and not matching the sched_switch() events.
> >
> > I can imagine that being somewhat confusing for people/tools.
> >
> > Acme/Jolsa, any idea if that will bugger perf-report?
>
> Hurm, then again, I imagine that after unhash_process the PID/TID could
> be instantly re-used and then we're still confused.

sounds bad.. I haven't checked the related pid_alive code,
but shouldn't we already get the EXIT event in this case?

jirka

2016-10-24 12:12:04

by Oleg Nesterov

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On 10/24, Oleg Nesterov wrote:
>
> On 10/24, Peter Zijlstra wrote:
> >
> > On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> > >
> > > --- x/kernel/events/core.c
> > > +++ x/kernel/events/core.c
> > > @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
> > > if (event->parent)
> > > event = event->parent;
> > >
> > > - return task_tgid_nr_ns(p, event->ns);
> > > + return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
> > > }
> >
> > Hurm.. should we not push this into task_tgid_nr_ns() ? I mean, now the
> > user needs to be aware of this dinky detail.
>
> Perhaps. Or into task_tgid(). Or even the patch below, __task_pid_nr_ns()
> is always safe. This certainly needs some cleanups.

the patch was obviously incomplete.

Oleg.

--- x/include/linux/pid.h
+++ x/include/linux/pid.h
@@ -8,7 +8,8 @@ enum pid_type
PIDTYPE_PID,
PIDTYPE_PGID,
PIDTYPE_SID,
- PIDTYPE_MAX
+ PIDTYPE_MAX,
+ PIDTYPE_TGID /* do not use */
};

/*
--- x/kernel/pid.c
+++ x/kernel/pid.c
@@ -526,8 +526,11 @@ pid_t __task_pid_nr_ns(struct task_struc
if (!ns)
ns = task_active_pid_ns(current);
if (likely(pid_alive(task))) {
- if (type != PIDTYPE_PID)
+ if (type != PIDTYPE_PID) {
+ if (type == PIDTYPE_TGID)
+ type = PIDTYPE_PID;
task = task->group_leader;
+ }
nr = pid_nr_ns(rcu_dereference(task->pids[type].pid), ns);
}
rcu_read_unlock();
@@ -538,7 +541,7 @@ EXPORT_SYMBOL(__task_pid_nr_ns);

pid_t task_tgid_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
{
- return pid_nr_ns(task_tgid(tsk), ns);
+ return __task_pid_nr_ns(tsk, PIDTYPE_TGID, ns);
}
EXPORT_SYMBOL(task_tgid_nr_ns);

2016-10-24 12:12:10

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On Mon, Oct 24, 2016 at 02:04:11PM +0200, Jiri Olsa wrote:
> On Mon, Oct 24, 2016 at 01:29:45PM +0200, Peter Zijlstra wrote:

> > Hurm, then again, I imagine that after unhash_process the PID/TID could
> > be instantly re-used and then we're still confused.
>
> sounds bad.. I haven't checked the related pid_alive code,
> but shouldn't we already get the EXIT event in this case?

It has, perf_event_exit_task() happens before we unhash.

But a per-cpu event that has PID/TID reporting on will run into this.

We'll observe 'funny' values between the unhash and the next context
switch.

2016-10-24 12:12:12

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> Yes, current is still valid.
>
> But nothing protects current->group_leader or parent/real_parent, they
> can point to the exited/freed task. We really need to nullify them in
> __unhash_process() to catch the problems like this, I wanted to do this
> many times...
>
> So you simply can't know your tgid or even tid after release_task() calls
> __unhash_process(). Actually after exit_notify() unless the exiting task
> autoreaps itself.
>
> How about the trivial fix below?
>
> Oleg.
>
> --- x/kernel/events/core.c
> +++ x/kernel/events/core.c
> @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
> if (event->parent)
> event = event->parent;
>
> - return task_tgid_nr_ns(p, event->ns);
> + return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
> }
>
> static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)

Should we do the same for perf_event_tid() and report -1 as the pid/tid
in the !alive case? -1 should be an obvious invalid pid since we limit
the pid-space to less than 32 bits.

2016-10-24 12:19:09

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On Mon, Oct 24, 2016 at 02:02:32PM +0200, Oleg Nesterov wrote:
> Perhaps. Or into task_tgid(). Or even the patch below, __task_pid_nr_ns()
> is always safe. This certainly needs some cleanups.

> --- x/include/linux/pid.h
> +++ x/include/linux/pid.h
> @@ -8,7 +8,8 @@ enum pid_type
> PIDTYPE_PID,
> PIDTYPE_PGID,
> PIDTYPE_SID,
> - PIDTYPE_MAX
> + PIDTYPE_MAX,
> + PIDTYPE_TGID /* do not use */
> };
>
> /*
> --- x/kernel/pid.c
> +++ x/kernel/pid.c
> @@ -538,7 +538,7 @@ EXPORT_SYMBOL(__task_pid_nr_ns);
>
> pid_t task_tgid_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
> {
> - return pid_nr_ns(task_tgid(tsk), ns);
> + return __task_pid_nr_ns(tsk, PIDTYPE_TGID, ns);
> }
> EXPORT_SYMBOL(task_tgid_nr_ns);
>
>

Right, that will return 0 on !alive. But I'm not seeing how PIDTYPE_TGID
isn't an array bound violating of its own though. Then again, I didn't
look to hard at the pid stuff.

2016-10-24 12:22:12

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On Mon, Oct 24, 2016 at 02:10:31PM +0200, Oleg Nesterov wrote:
> --- x/kernel/pid.c
> +++ x/kernel/pid.c
> @@ -526,8 +526,11 @@ pid_t __task_pid_nr_ns(struct task_struc
> if (!ns)
> ns = task_active_pid_ns(current);
> if (likely(pid_alive(task))) {
> - if (type != PIDTYPE_PID)
> + if (type != PIDTYPE_PID) {
> + if (type == PIDTYPE_TGID)
> + type = PIDTYPE_PID;
> task = task->group_leader;
> + }

Aah, that makes much more sense ;-)

> nr = pid_nr_ns(rcu_dereference(task->pids[type].pid), ns);
> }
> rcu_read_unlock();

Still, I wonder if returning 0 is the right thing. 0 is a 'valid' PID
for the init/idle task.

And we still have the re-use issue for the TID, because when we get here
TID is already unhashed too afaict, it just doesn't explode because we
don't deref freed memory.

2016-10-24 12:22:57

by Oleg Nesterov

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On 10/24, Peter Zijlstra wrote:
>
> On Mon, Oct 24, 2016 at 01:15:27PM +0200, Oleg Nesterov wrote:
> >
> > --- x/kernel/events/core.c
> > +++ x/kernel/events/core.c
> > @@ -1257,7 +1257,7 @@ static u32 perf_event_pid(struct perf_ev
> > if (event->parent)
> > event = event->parent;
> >
> > - return task_tgid_nr_ns(p, event->ns);
> > + return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : 0;
> > }
> >
> > static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
>
> Should we do the same for perf_event_tid() and report -1 as the pid/tid
> in the !alive case? -1 should be an obvious invalid pid since we limit
> the pid-space to less than 32 bits.

task_pid_nr_ns() is always safe, it calls __task_pid_nr_ns(). But yes,
it can return zero if called after exit_notify() and/or release_task().

And while zero is not a valid pid too, I guess it can be confused with
the idle thread's "pid" ?

Oleg.

2016-10-24 12:27:50

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On Mon, Oct 24, 2016 at 02:21:23PM +0200, Oleg Nesterov wrote:
> > Should we do the same for perf_event_tid() and report -1 as the pid/tid
> > in the !alive case? -1 should be an obvious invalid pid since we limit
> > the pid-space to less than 32 bits.
>
> task_pid_nr_ns() is always safe, it calls __task_pid_nr_ns(). But yes,
> it can return zero if called after exit_notify() and/or release_task().
>
> And while zero is not a valid pid too, I guess it can be confused with
> the idle thread's "pid" ?

Right, 0 is the idle thread.

2016-10-24 12:31:20

by Oleg Nesterov

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On 10/24, Peter Zijlstra wrote:
>
> On Mon, Oct 24, 2016 at 02:10:31PM +0200, Oleg Nesterov wrote:
> > --- x/kernel/pid.c
> > +++ x/kernel/pid.c
> > @@ -526,8 +526,11 @@ pid_t __task_pid_nr_ns(struct task_struc
> > if (!ns)
> > ns = task_active_pid_ns(current);
> > if (likely(pid_alive(task))) {
> > - if (type != PIDTYPE_PID)
> > + if (type != PIDTYPE_PID) {
> > + if (type == PIDTYPE_TGID)
> > + type = PIDTYPE_PID;
> > task = task->group_leader;
> > + }
>
> Aah, that makes much more sense ;-)
>
> > nr = pid_nr_ns(rcu_dereference(task->pids[type].pid), ns);
> > }
> > rcu_read_unlock();
>
>
> Still, I wonder if returning 0 is the right thing. 0 is a 'valid' PID
> for the init/idle task.

Yes, now I think that -1 would make more sense. Unfortunately we can't
just change __task_pid_nr_ns(), it already has the users which assume
it returns zero... attach_to_pi_state() for example.

> And we still have the re-use issue for the TID, because when we get here
> TID is already unhashed too afaict,

Yes, so perf_event_tid() will report zero.

Oleg.

2016-10-24 12:38:19

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On Mon, Oct 24, 2016 at 02:29:42PM +0200, Oleg Nesterov wrote:
> On 10/24, Peter Zijlstra wrote:
> >
> > On Mon, Oct 24, 2016 at 02:10:31PM +0200, Oleg Nesterov wrote:
> > > --- x/kernel/pid.c
> > > +++ x/kernel/pid.c
> > > @@ -526,8 +526,11 @@ pid_t __task_pid_nr_ns(struct task_struc
> > > if (!ns)
> > > ns = task_active_pid_ns(current);
> > > if (likely(pid_alive(task))) {
> > > - if (type != PIDTYPE_PID)
> > > + if (type != PIDTYPE_PID) {
> > > + if (type == PIDTYPE_TGID)
> > > + type = PIDTYPE_PID;
> > > task = task->group_leader;
> > > + }
> >
> > Aah, that makes much more sense ;-)
> >
> > > nr = pid_nr_ns(rcu_dereference(task->pids[type].pid), ns);
> > > }
> > > rcu_read_unlock();
> >
> >
> > Still, I wonder if returning 0 is the right thing. 0 is a 'valid' PID
> > for the init/idle task.
>
> Yes, now I think that -1 would make more sense. Unfortunately we can't
> just change __task_pid_nr_ns(), it already has the users which assume
> it returns zero... attach_to_pi_state() for example.

Indeed. And I have a patch that assumes task_pid_vnr(&init_task) == 0,
is that true because of this !alive case or true in general?

No worries though, we can revert to your earlier explicit test and
return -1 while adding a comment to explain details? I'll go write one
up in a bit, but I need to run an errand first.

> > And we still have the re-use issue for the TID, because when we get here
> > TID is already unhashed too afaict,
>
> Yes, so perf_event_tid() will report zero.

Ah, ok. So whould we change that to match pid and return (explicit) -1
there too?

2016-10-24 13:27:30

by Oleg Nesterov

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On 10/24, Peter Zijlstra wrote:
>
> On Mon, Oct 24, 2016 at 02:29:42PM +0200, Oleg Nesterov wrote:
> > On 10/24, Peter Zijlstra wrote:
> > >
> > > Still, I wonder if returning 0 is the right thing. 0 is a 'valid' PID
> > > for the init/idle task.
> >
> > Yes, now I think that -1 would make more sense. Unfortunately we can't
> > just change __task_pid_nr_ns(), it already has the users which assume
> > it returns zero... attach_to_pi_state() for example.
>
> Indeed. And I have a patch that assumes task_pid_vnr(&init_task) == 0,
> is that true because of this !alive case or true in general?

This is true in general. Idle threads are always alive but they use the
the special init_struct_pid with .nr == 0.

> No worries though, we can revert to your earlier explicit test and
> return -1 while adding a comment to explain details?

...

> Ah, ok. So whould we change that to match pid and return (explicit) -1
> there too?

Well, if we add that PIDTYPE_TGID hack, I think we can do something
like below...

Or do you think we should add a perf_alive() check into perf_event_pid()
for a quick fix?

Either way it's a pity we can't report at least the valid tid, perhaps
perf_event_tid() could use task_pid_nr() if event->ns == init_pid_ns,
I dunno.

Oleg.

--- x/kernel/events/core.c
+++ x/kernel/events/core.c
@@ -1249,26 +1249,30 @@ unclone_ctx(struct perf_event_context *c
return parent_ctx;
}

-static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
+static u32 perf_event_xxx(struct perf_event *event, struct task_struct *p,
+ enum pid_type type)
{
+ pid_t nr;
/*
* only top level events have the pid namespace they were created in
*/
if (event->parent)
event = event->parent;

- return task_tgid_nr_ns(p, event->ns);
+ nr = __task_pid_nr_ns(p, type, event->ns);
+ if (!nr && !is_idle_task(p))
+ nr = -1;
+ return nr;
}

-static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
+static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
{
- /*
- * only top level events have the pid namespace they were created in
- */
- if (event->parent)
- event = event->parent;
+ return perf_event_xxx(p, event, PIDTYPE_TGID);
+}

- return task_pid_nr_ns(p, event->ns);
+static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
+{
+ return perf_event_xxx(p, event, PIDTYPE_PID);
}

/*

2016-10-24 13:41:49

by Oleg Nesterov

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On 10/24, Oleg Nesterov wrote:
>
> -static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
> +static u32 perf_event_xxx(struct perf_event *event, struct task_struct *p,
> + enum pid_type type)
> {
> + pid_t nr;
> /*
> * only top level events have the pid namespace they were created in
> */
> if (event->parent)
> event = event->parent;
>
> - return task_tgid_nr_ns(p, event->ns);
> + nr = __task_pid_nr_ns(p, type, event->ns);
> + if (!nr && !is_idle_task(p))
> + nr = -1;
> + return nr;

And just in case... In any case __task_pid_nr_ns() and other similar helpers
can also return zero if "p" runs in another namespace. Say, in the parent ns.

Say, perf_event_switch_output(). What do we want to report in this case, zero
or -1 ?

Oleg.

2016-10-24 14:18:04

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On Mon, Oct 24, 2016 at 03:40:13PM +0200, Oleg Nesterov wrote:
> On 10/24, Oleg Nesterov wrote:
> >
> > -static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
> > +static u32 perf_event_xxx(struct perf_event *event, struct task_struct *p,
> > + enum pid_type type)
> > {
> > + pid_t nr;
> > /*
> > * only top level events have the pid namespace they were created in
> > */
> > if (event->parent)
> > event = event->parent;
> >
> > - return task_tgid_nr_ns(p, event->ns);
> > + nr = __task_pid_nr_ns(p, type, event->ns);
> > + if (!nr && !is_idle_task(p))
> > + nr = -1;
> > + return nr;
>
> And just in case... In any case __task_pid_nr_ns() and other similar helpers
> can also return zero if "p" runs in another namespace. Say, in the parent ns.

Right, I'm tempted to not change that. Its been the behaviour for a
while and changing that will upset people.

The unhash case is different in that its actively broken so we must do
something.

> Say, perf_event_switch_output(). What do we want to report in this case, zero
> or -1 ?

As with all switch_output() cases, the user had better know wth he's
doing ;-) Doing a switch_output() on a running counter is dubious to
begin with.

2016-10-24 14:36:53

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On Mon, Oct 24, 2016 at 03:25:55PM +0200, Oleg Nesterov wrote:
> Well, if we add that PIDTYPE_TGID hack, I think we can do something
> like below...
>
> Or do you think we should add a perf_alive() check into perf_event_pid()
> for a quick fix?

That is what I was thinking. Then we don't need to do the TGID hack,
I suspect some people might object to that.

> Either way it's a pity we can't report at least the valid tid, perhaps
> perf_event_tid() could use task_pid_nr() if event->ns == init_pid_ns,
> I dunno.

Right, but after unhash is there really still the notion of a valid TID?
I mean, the TID can be reused, at which point you'll end up with two
tasks etc..

But yes, very tedious.

I was thinking something like so?

---

kernel/events/core.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index c6e47e97b33f..2c9a22485e9e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1257,7 +1257,14 @@ static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
if (event->parent)
event = event->parent;

- return task_tgid_nr_ns(p, event->ns);
+ /*
+ * It is possible the task already got unhashed, in which case we
+ * cannot determine the current->group_leader/real_parent.
+ *
+ * Also, report -1 to indicate unhashed, so as not to confused with
+ * 0 for the idle task.
+ */
+ return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : ~0;
}

static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
@@ -1268,7 +1275,7 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
if (event->parent)
event = event->parent;

- return task_pid_nr_ns(p, event->ns);
+ return pid_alive(p) ? task_pid_nr_ns(p, event->ns) : ~0;
}

/*

2016-10-24 15:40:44

by Oleg Nesterov

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On 10/24, Peter Zijlstra wrote:
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1257,7 +1257,14 @@ static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
> if (event->parent)
> event = event->parent;
>
> - return task_tgid_nr_ns(p, event->ns);
> + /*
> + * It is possible the task already got unhashed, in which case we
> + * cannot determine the current->group_leader/real_parent.
> + *
> + * Also, report -1 to indicate unhashed, so as not to confused with
> + * 0 for the idle task.
> + */
> + return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : ~0;
> }

Yes, but this _looks_ racy unless p == current. I mean, pid_alive() makes
task_tgid_nr_ns() safe, but task_tgid_nr_ns() still can return zero _if_
it can race with the exiting task.

> static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> @@ -1268,7 +1275,7 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> if (event->parent)
> event = event->parent;
>
> - return task_pid_nr_ns(p, event->ns);
> + return pid_alive(p) ? task_pid_nr_ns(p, event->ns) : ~0;

The same.

However. At first glance the only case when p != current is copy_process(),
right? And in this case the new child can't go away. So I think this patch
is fine.

Oleg.

2016-10-24 15:54:42

by Oleg Nesterov

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On 10/24, Oleg Nesterov wrote:
>
> On 10/24, Peter Zijlstra wrote:
> >
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1257,7 +1257,14 @@ static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
> > if (event->parent)
> > event = event->parent;
> >
> > - return task_tgid_nr_ns(p, event->ns);
> > + /*
> > + * It is possible the task already got unhashed, in which case we
> > + * cannot determine the current->group_leader/real_parent.
> > + *
> > + * Also, report -1 to indicate unhashed, so as not to confused with
> > + * 0 for the idle task.
> > + */
> > + return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : ~0;
> > }
>
> Yes, but this _looks_ racy unless p == current. I mean, pid_alive() makes
> task_tgid_nr_ns() safe, but task_tgid_nr_ns() still can return zero _if_
> it can race with the exiting task.
>
> > static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> > @@ -1268,7 +1275,7 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> > if (event->parent)
> > event = event->parent;
> >
> > - return task_pid_nr_ns(p, event->ns);
> > + return pid_alive(p) ? task_pid_nr_ns(p, event->ns) : ~0;
>
> The same.
>
> However. At first glance the only case when p != current is copy_process(),
> right? And in this case the new child can't go away. So I think this patch
> is fine.

Actually there is another case, comm_write() -> perf_event_comm_output(). It
checks same_thread_group(current, p), so we can only race with the exiting
sub-thread. perf_event_pid() can't return zero, perf_event_tid() can.

And I personally think we do not care and your patch is fine anyway ;)

Oleg.

2016-10-25 06:56:13

by Baole Ni

[permalink] [raw]

Subject: RE: hit a KASan bug related to Perf during stress test

Thanks a lot, guys.
I will take Peter's patch to do stress test.

-----Original Message-----
From: Oleg Nesterov [mailto:[email protected]]
Sent: Monday, October 24, 2016 11:53 PM
To: Peter Zijlstra
Cc: Ni, BaoleX; [email protected]; [email protected]; [email protected]; [email protected]; Liu, Chuansheng
Subject: Re: hit a KASan bug related to Perf during stress test

On 10/24, Oleg Nesterov wrote:
>
> On 10/24, Peter Zijlstra wrote:
> >
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1257,7 +1257,14 @@ static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
> > if (event->parent)
> > event = event->parent;
> >
> > - return task_tgid_nr_ns(p, event->ns);
> > + /*
> > + * It is possible the task already got unhashed, in which case we
> > + * cannot determine the current->group_leader/real_parent.
> > + *
> > + * Also, report -1 to indicate unhashed, so as not to confused with
> > + * 0 for the idle task.
> > + */
> > + return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : ~0;
> > }
>
> Yes, but this _looks_ racy unless p == current. I mean, pid_alive()
> makes
> task_tgid_nr_ns() safe, but task_tgid_nr_ns() still can return zero
> _if_ it can race with the exiting task.
>
> > static u32 perf_event_tid(struct perf_event *event, struct
> > task_struct *p) @@ -1268,7 +1275,7 @@ static u32 perf_event_tid(struct perf_event *event, struct task_struct *p)
> > if (event->parent)
> > event = event->parent;
> >
> > - return task_pid_nr_ns(p, event->ns);
> > + return pid_alive(p) ? task_pid_nr_ns(p, event->ns) : ~0;
>
> The same.
>
> However. At first glance the only case when p != current is
> copy_process(), right? And in this case the new child can't go away.
> So I think this patch is fine.

Actually there is another case, comm_write() -> perf_event_comm_output(). It checks same_thread_group(current, p), so we can only race with the exiting sub-thread. perf_event_pid() can't return zero, perf_event_tid() can.

And I personally think we do not care and your patch is fine anyway ;)

Oleg.

2016-10-25 09:28:34

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On Mon, Oct 24, 2016 at 05:39:08PM +0200, Oleg Nesterov wrote:
> On 10/24, Peter Zijlstra wrote:
> >
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -1257,7 +1257,14 @@ static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
> > if (event->parent)
> > event = event->parent;
> >
> > - return task_tgid_nr_ns(p, event->ns);
> > + /*
> > + * It is possible the task already got unhashed, in which case we
> > + * cannot determine the current->group_leader/real_parent.
> > + *
> > + * Also, report -1 to indicate unhashed, so as not to confused with
> > + * 0 for the idle task.
> > + */
> > + return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : ~0;
> > }
>
> Yes, but this _looks_ racy unless p == current. I mean, pid_alive() makes
> task_tgid_nr_ns() safe, but task_tgid_nr_ns() still can return zero _if_
> it can race with the exiting task.

So what serialization would close that race? __task_pid_nr_ns() only
seems to use RCU nothing more.

2016-10-25 14:43:03

by Oleg Nesterov

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On 10/25, Peter Zijlstra wrote:
>
> On Mon, Oct 24, 2016 at 05:39:08PM +0200, Oleg Nesterov wrote:
> > On 10/24, Peter Zijlstra wrote:
> > >
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -1257,7 +1257,14 @@ static u32 perf_event_pid(struct perf_event *event, struct task_struct *p)
> > > if (event->parent)
> > > event = event->parent;
> > >
> > > - return task_tgid_nr_ns(p, event->ns);
> > > + /*
> > > + * It is possible the task already got unhashed, in which case we
> > > + * cannot determine the current->group_leader/real_parent.
> > > + *
> > > + * Also, report -1 to indicate unhashed, so as not to confused with
> > > + * 0 for the idle task.
> > > + */
> > > + return pid_alive(p) ? task_tgid_nr_ns(p, event->ns) : ~0;
> > > }
> >
> > Yes, but this _looks_ racy unless p == current. I mean, pid_alive() makes
> > task_tgid_nr_ns() safe, but task_tgid_nr_ns() still can return zero _if_
> > it can race with the exiting task.
>
> So what serialization would close that race? __task_pid_nr_ns() only
> seems to use RCU nothing more.

I do not see how can we close this race, we obviously do not want to use
any locking.

That is why I tried to suggest

nr = __task_pid_nr_ns(p, type, event->ns);
if (!nr && !is_idle_task(p))
nr = -1;
return nr;

but this will report -1 if p runs in another namespace, so perhaps we
can do

nr = __task_pid_nr_ns(p, type, event->ns);
if (!nr && p->exit_state)
// it has already called exit_notify
nr = -1;
return nr;

Oleg.

2016-10-26 09:03:40

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On Tue, Oct 25, 2016 at 04:41:26PM +0200, Oleg Nesterov wrote:
> >
> > So what serialization would close that race? __task_pid_nr_ns() only
> > seems to use RCU nothing more.
>
> I do not see how can we close this race, we obviously do not want to use
> any locking.
>
> That is why I tried to suggest
>
> nr = __task_pid_nr_ns(p, type, event->ns);
> if (!nr && !is_idle_task(p))
> nr = -1;
> return nr;
>
> but this will report -1 if p runs in another namespace, so perhaps we
> can do
>
> nr = __task_pid_nr_ns(p, type, event->ns);
> if (!nr && p->exit_state)
> // it has already called exit_notify
> nr = -1;
> return nr;

I think I'm asking how __task_pid_nr_ns() isn't susceptible to this race
;-)

2016-10-26 16:12:15

by Oleg Nesterov

[permalink] [raw]

Subject: Re: hit a KASan bug related to Perf during stress test

On 10/26, Peter Zijlstra wrote:
>
> On Tue, Oct 25, 2016 at 04:41:26PM +0200, Oleg Nesterov wrote:
> > >
> > > So what serialization would close that race? __task_pid_nr_ns() only
> > > seems to use RCU nothing more.
> >
> > I do not see how can we close this race, we obviously do not want to use
> > any locking.
> >
> > That is why I tried to suggest
> >
> > nr = __task_pid_nr_ns(p, type, event->ns);
> > if (!nr && !is_idle_task(p))
> > nr = -1;
> > return nr;
> >
> > but this will report -1 if p runs in another namespace, so perhaps we
> > can do
> >
> > nr = __task_pid_nr_ns(p, type, event->ns);
> > if (!nr && p->exit_state)
> > // it has already called exit_notify
> > nr = -1;
> > return nr;
>
> I think I'm asking how __task_pid_nr_ns() isn't susceptible to this race
> ;-)

which race ? ;) it seems that I confused you. Lets ignore the original
problem with perf_event_pid()->task_tgid_nr_ns() which can access the
freed memory. Lets suppose it is already fixed.

Another problem, as you noted, is that task_tgid_nr_ns/task_pid_nr_ns
returns zero if the task exits and this zero can be confused with the
swapper's pid.

return pid_alive(p) ? task_pid_nr_ns(p, event->ns) : ~0

still can return zero because pid_alive(p) == T is not stable if we can
race with the exiting task, so it can't guarantee that task_pid_nr_ns()
won't return 0.

So we can check ->exit_state or, even better, that same pid_alive() after
task_pid_nr_ns() returns 0.

nr = task_pid_nr_ns(p);
/* avoid -1 if it is idle thread or runs in another ns */
if (!nr && !pid_alive(p))
nr = -1;
return nr;

Or I misunderstood you?

Oleg.