2017-06-28 01:01:09

by Kyle Huey

[permalink] [raw]
Subject: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

Sent again with LKML CCd, sorry for the noise.

- Kyle

On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey <[email protected]> wrote:
> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be
> a candidate for backporting to stable branches.
>
> rr, a userspace record and replay debugger[0], uses the PMU interrupt
> to stop a program during replay to inject asynchronous events such as
> signals. We are counting retired conditional branches in userspace
> only. This changeset causes the kernel to drop interrupts on the
> floor if, during the PMU interrupt's "skid" region, the CPU enters
> kernel mode for whatever reason. When replaying traces of complex
> programs such as Firefox, we intermittently fail to deliver
> asynchronous events on time, leading the replay to diverge from the
> recorded state.
>
> It seems like this change should, at a bare minimum, be limited to
> counters that actually perform sampling of register state when the
> interrupt fires. In our case, with the retired conditional branches
> counter restricted to counting userspace events only, it makes no
> difference that the PMU interrupt happened to be delivered in the
> kernel.
>
> As this makes rr unusable on complex applications and cannot be
> efficiently worked around, we would appreciate this being addressed
> before 4.12 is finalized, and the regression not being introduced to
> stable branches.
>
> Thanks,
>
> - Kyle
>
> [0] http://rr-project.org/


2017-06-28 02:09:13

by Jin Yao

[permalink] [raw]
Subject: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

Hi,

In theory, the PMI interrupts in skid region should be dropped, right?

For a userspace debugger, is it the only choice that relies on the
*skid* PMI interrupt?

Thanks
Jin Yao

On 6/28/2017 9:01 AM, Kyle Huey wrote:
> Sent again with LKML CCd, sorry for the noise.
>
> - Kyle
>
> On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey <[email protected]> wrote:
>> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be
>> a candidate for backporting to stable branches.
>>
>> rr, a userspace record and replay debugger[0], uses the PMU interrupt
>> to stop a program during replay to inject asynchronous events such as
>> signals. We are counting retired conditional branches in userspace
>> only. This changeset causes the kernel to drop interrupts on the
>> floor if, during the PMU interrupt's "skid" region, the CPU enters
>> kernel mode for whatever reason. When replaying traces of complex
>> programs such as Firefox, we intermittently fail to deliver
>> asynchronous events on time, leading the replay to diverge from the
>> recorded state.
>>
>> It seems like this change should, at a bare minimum, be limited to
>> counters that actually perform sampling of register state when the
>> interrupt fires. In our case, with the retired conditional branches
>> counter restricted to counting userspace events only, it makes no
>> difference that the PMU interrupt happened to be delivered in the
>> kernel.
>>
>> As this makes rr unusable on complex applications and cannot be
>> efficiently worked around, we would appreciate this being addressed
>> before 4.12 is finalized, and the regression not being introduced to
>> stable branches.
>>
>> Thanks,
>>
>> - Kyle
>>
>> [0] http://rr-project.org/

2017-06-28 04:51:10

by Kyle Huey

[permalink] [raw]
Subject: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao <[email protected]> wrote:
> Hi,
>
> In theory, the PMI interrupts in skid region should be dropped, right?

No, why would they be dropped?

My understanding of the situation is as follows:

There is some time, call it t_0, where the hardware counter overflows.
The PMU triggers an interrupt, but this is not instantaneous. Call
the time when the interrupt is actually delivered t_1. Then t_1 - t_0
is the "skid".

Note that if the counter is `exclude_kernel`, then at t_0 the CPU
*must* be running a userspace program. But by t_1, the CPU may be
doing something else. Your patch changed things so that if at t_1 the
CPU is in the kernel, then the interrupt is discarded. But rr has
programmed the counter to deliver a signal on overflow (via F_SETSIG
on the fd returned by perf_event_open). This change results in the
signal never being delivered, because the interrupt was ignored.
(More accurately, the signal is delivered the *next* time the counter
overflows, which is far past where we wanted to inject our
asynchronous event into our tracee.

It seems to me that it might be reasonable to ignore the interrupt if
the purpose of the interrupt is to trigger sampling of the CPUs
register state. But if the interrupt will trigger some other
operation, such as a signal on an fd, then there's no reason to drop
it.

> For a userspace debugger, is it the only choice that relies on the *skid*
> PMI interrupt?

I don't understand this question, but hopefully the above clarified things.

- Kyle

> Thanks
> Jin Yao
>
>
> On 6/28/2017 9:01 AM, Kyle Huey wrote:
>>
>> Sent again with LKML CCd, sorry for the noise.
>>
>> - Kyle
>>
>> On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey <[email protected]> wrote:
>>>
>>> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be
>>> a candidate for backporting to stable branches.
>>>
>>> rr, a userspace record and replay debugger[0], uses the PMU interrupt
>>> to stop a program during replay to inject asynchronous events such as
>>> signals. We are counting retired conditional branches in userspace
>>> only. This changeset causes the kernel to drop interrupts on the
>>> floor if, during the PMU interrupt's "skid" region, the CPU enters
>>> kernel mode for whatever reason. When replaying traces of complex
>>> programs such as Firefox, we intermittently fail to deliver
>>> asynchronous events on time, leading the replay to diverge from the
>>> recorded state.
>>>
>>> It seems like this change should, at a bare minimum, be limited to
>>> counters that actually perform sampling of register state when the
>>> interrupt fires. In our case, with the retired conditional branches
>>> counter restricted to counting userspace events only, it makes no
>>> difference that the PMU interrupt happened to be delivered in the
>>> kernel.
>>>
>>> As this makes rr unusable on complex applications and cannot be
>>> efficiently worked around, we would appreciate this being addressed
>>> before 4.12 is finalized, and the regression not being introduced to
>>> stable branches.
>>>
>>> Thanks,
>>>
>>> - Kyle
>>>
>>> [0] http://rr-project.org/
>
>

2017-06-28 05:35:54

by Jin Yao

[permalink] [raw]
Subject: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

Hi Kyle,

I understand your requirement. Sorry I don't know the detail of rr
debugger, but I guess if it just uses counter overflow to deliver a
signal, could it set the counter without "exclude_kernel"?

Another consideration is, I'm not sure if the kernel can accurately
realize that if the interrupt is to trigger sampling or just deliver a
signal. Userspace may know that but kernel may not.

Anyway let's listen to more comments from community.

Thanks

Jin Yao


On 6/28/2017 12:51 PM, Kyle Huey wrote:
> On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao <[email protected]> wrote:
>> Hi,
>>
>> In theory, the PMI interrupts in skid region should be dropped, right?
> No, why would they be dropped?
>
> My understanding of the situation is as follows:
>
> There is some time, call it t_0, where the hardware counter overflows.
> The PMU triggers an interrupt, but this is not instantaneous. Call
> the time when the interrupt is actually delivered t_1. Then t_1 - t_0
> is the "skid".
>
> Note that if the counter is `exclude_kernel`, then at t_0 the CPU
> *must* be running a userspace program. But by t_1, the CPU may be
> doing something else. Your patch changed things so that if at t_1 the
> CPU is in the kernel, then the interrupt is discarded. But rr has
> programmed the counter to deliver a signal on overflow (via F_SETSIG
> on the fd returned by perf_event_open). This change results in the
> signal never being delivered, because the interrupt was ignored.
> (More accurately, the signal is delivered the *next* time the counter
> overflows, which is far past where we wanted to inject our
> asynchronous event into our tracee.
>
> It seems to me that it might be reasonable to ignore the interrupt if
> the purpose of the interrupt is to trigger sampling of the CPUs
> register state. But if the interrupt will trigger some other
> operation, such as a signal on an fd, then there's no reason to drop
> it.
>
>> For a userspace debugger, is it the only choice that relies on the *skid*
>> PMI interrupt?
> I don't understand this question, but hopefully the above clarified things.
>
> - Kyle
>
>> Thanks
>> Jin Yao
>>
>>
>> On 6/28/2017 9:01 AM, Kyle Huey wrote:
>>> Sent again with LKML CCd, sorry for the noise.
>>>
>>> - Kyle
>>>
>>> On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey <[email protected]> wrote:
>>>> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be
>>>> a candidate for backporting to stable branches.
>>>>
>>>> rr, a userspace record and replay debugger[0], uses the PMU interrupt
>>>> to stop a program during replay to inject asynchronous events such as
>>>> signals. We are counting retired conditional branches in userspace
>>>> only. This changeset causes the kernel to drop interrupts on the
>>>> floor if, during the PMU interrupt's "skid" region, the CPU enters
>>>> kernel mode for whatever reason. When replaying traces of complex
>>>> programs such as Firefox, we intermittently fail to deliver
>>>> asynchronous events on time, leading the replay to diverge from the
>>>> recorded state.
>>>>
>>>> It seems like this change should, at a bare minimum, be limited to
>>>> counters that actually perform sampling of register state when the
>>>> interrupt fires. In our case, with the retired conditional branches
>>>> counter restricted to counting userspace events only, it makes no
>>>> difference that the PMU interrupt happened to be delivered in the
>>>> kernel.
>>>>
>>>> As this makes rr unusable on complex applications and cannot be
>>>> efficiently worked around, we would appreciate this being addressed
>>>> before 4.12 is finalized, and the regression not being introduced to
>>>> stable branches.
>>>>
>>>> Thanks,
>>>>
>>>> - Kyle
>>>>
>>>> [0] http://rr-project.org/
>>

2017-06-28 07:30:35

by Kyle Huey

[permalink] [raw]
Subject: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

On Tue, Jun 27, 2017 at 10:35 PM, Jin, Yao <[email protected]> wrote:
> Hi Kyle,
>
> I understand your requirement. Sorry I don't know the detail of rr debugger,
> but I guess if it just uses counter overflow to deliver a signal, could it
> set the counter without "exclude_kernel"?

Unfortunately we cannot. We depend on the counter value being exactly
the same between recording and replay, and dropping `exclude_kernel`
would introduce non-determinism.

> Another consideration is, I'm not sure if the kernel can accurately realize
> that if the interrupt is to trigger sampling or just deliver a signal.
> Userspace may know that but kernel may not.

After looking at this code a bit more, I think that changing the
`is_sample_allowed` check from an early return to a guard around the
invocation of `overflow_handler` would fix this. I believe, but have
not tested, that `perf_event_fasync` is what must run to deliver our
signal, while the `overflow_handler` is what copies the kernel RIP/etc
into the output buffer that you want to skip.

- Kyle

> Anyway let's listen to more comments from community.
>
> Thanks
>
> Jin Yao
>
>
>
> On 6/28/2017 12:51 PM, Kyle Huey wrote:
>>
>> On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> In theory, the PMI interrupts in skid region should be dropped, right?
>>
>> No, why would they be dropped?
>>
>> My understanding of the situation is as follows:
>>
>> There is some time, call it t_0, where the hardware counter overflows.
>> The PMU triggers an interrupt, but this is not instantaneous. Call
>> the time when the interrupt is actually delivered t_1. Then t_1 - t_0
>> is the "skid".
>>
>> Note that if the counter is `exclude_kernel`, then at t_0 the CPU
>> *must* be running a userspace program. But by t_1, the CPU may be
>> doing something else. Your patch changed things so that if at t_1 the
>> CPU is in the kernel, then the interrupt is discarded. But rr has
>> programmed the counter to deliver a signal on overflow (via F_SETSIG
>> on the fd returned by perf_event_open). This change results in the
>> signal never being delivered, because the interrupt was ignored.
>> (More accurately, the signal is delivered the *next* time the counter
>> overflows, which is far past where we wanted to inject our
>> asynchronous event into our tracee.
>>
>> It seems to me that it might be reasonable to ignore the interrupt if
>> the purpose of the interrupt is to trigger sampling of the CPUs
>> register state. But if the interrupt will trigger some other
>> operation, such as a signal on an fd, then there's no reason to drop
>> it.
>>
>>> For a userspace debugger, is it the only choice that relies on the *skid*
>>> PMI interrupt?
>>
>> I don't understand this question, but hopefully the above clarified
>> things.
>>
>> - Kyle
>>
>>> Thanks
>>> Jin Yao
>>>
>>>
>>> On 6/28/2017 9:01 AM, Kyle Huey wrote:
>>>>
>>>> Sent again with LKML CCd, sorry for the noise.
>>>>
>>>> - Kyle
>>>>
>>>> On Tue, Jun 27, 2017 at 5:38 PM, Kyle Huey <[email protected]> wrote:
>>>>>
>>>>> cc1582c231ea introduced a regression in v4.12.0-rc5, and appears to be
>>>>> a candidate for backporting to stable branches.
>>>>>
>>>>> rr, a userspace record and replay debugger[0], uses the PMU interrupt
>>>>> to stop a program during replay to inject asynchronous events such as
>>>>> signals. We are counting retired conditional branches in userspace
>>>>> only. This changeset causes the kernel to drop interrupts on the
>>>>> floor if, during the PMU interrupt's "skid" region, the CPU enters
>>>>> kernel mode for whatever reason. When replaying traces of complex
>>>>> programs such as Firefox, we intermittently fail to deliver
>>>>> asynchronous events on time, leading the replay to diverge from the
>>>>> recorded state.
>>>>>
>>>>> It seems like this change should, at a bare minimum, be limited to
>>>>> counters that actually perform sampling of register state when the
>>>>> interrupt fires. In our case, with the retired conditional branches
>>>>> counter restricted to counting userspace events only, it makes no
>>>>> difference that the PMU interrupt happened to be delivered in the
>>>>> kernel.
>>>>>
>>>>> As this makes rr unusable on complex applications and cannot be
>>>>> efficiently worked around, we would appreciate this being addressed
>>>>> before 4.12 is finalized, and the regression not being introduced to
>>>>> stable branches.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> - Kyle
>>>>>
>>>>> [0] http://rr-project.org/
>>>
>>>
>

2017-06-28 10:13:57

by Mark Rutland

[permalink] [raw]
Subject: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote:
> On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao <[email protected]> wrote:
> > Hi,
> >
> > In theory, the PMI interrupts in skid region should be dropped, right?
>
> No, why would they be dropped?
>
> My understanding of the situation is as follows:
>
> There is some time, call it t_0, where the hardware counter overflows.
> The PMU triggers an interrupt, but this is not instantaneous. Call
> the time when the interrupt is actually delivered t_1. Then t_1 - t_0
> is the "skid".
>
> Note that if the counter is `exclude_kernel`, then at t_0 the CPU
> *must* be running a userspace program. But by t_1, the CPU may be
> doing something else. Your patch changed things so that if at t_1 the
> CPU is in the kernel, then the interrupt is discarded. But rr has
> programmed the counter to deliver a signal on overflow (via F_SETSIG
> on the fd returned by perf_event_open). This change results in the
> signal never being delivered, because the interrupt was ignored.
> (More accurately, the signal is delivered the *next* time the counter
> overflows, which is far past where we wanted to inject our
> asynchronous event into our tracee.

Yes, this is a bug.

As we're trying to avoid smapling state, I think we can move the check
into perf_prepare_sample() or __perf_event_output(), where that state is
actually sampled. I'll take a look at that momentarily.

Just to clarify, you don't care about the sample state at all? i.e. you
don't need the user program counter?

Is that signal delivered to the tracee, or to a different process that
traces it? If the latter, what ensures that the task is stopped
sufficiently quickly?

> It seems to me that it might be reasonable to ignore the interrupt if
> the purpose of the interrupt is to trigger sampling of the CPUs
> register state. But if the interrupt will trigger some other
> operation, such as a signal on an fd, then there's no reason to drop
> it.

Agreed. I'll try to have a patch for this soon.

I just need to figure out exactly where that overflow signal is
generated by the perf core.

Thanks,
Mark.

2017-06-28 10:57:00

by Mark Rutland

[permalink] [raw]
Subject: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote:
> On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote:

> As we're trying to avoid smapling state, I think we can move the check
> into perf_prepare_sample() or __perf_event_output(), where that state is
> actually sampled. I'll take a look at that momentarily.
>
> Just to clarify, you don't care about the sample state at all? i.e. you
> don't need the user program counter?
>
> Is that signal delivered to the tracee, or to a different process that
> traces it? If the latter, what ensures that the task is stopped
> sufficiently quickly?
>
> > It seems to me that it might be reasonable to ignore the interrupt if
> > the purpose of the interrupt is to trigger sampling of the CPUs
> > register state. But if the interrupt will trigger some other
> > operation, such as a signal on an fd, then there's no reason to drop
> > it.
>
> Agreed. I'll try to have a patch for this soon.
>
> I just need to figure out exactly where that overflow signal is
> generated by the perf core.

I've figured that out now. That's handled by perf_pending_event(), whcih
is the irq_work we schedule in __perf_event_overflow().

Does the below patch work for you?

---->8----
>From bb1f99dace508ce34ab0818f91d59e73450fa142 Mon Sep 17 00:00:00 2001
From: Mark Rutland <[email protected]>
Date: Wed, 28 Jun 2017 11:39:25 +0100
Subject: [PATCH] perf/core: generate overflow signal when samples are dropped

We recently tried to kill an information leak where users could receive
kernel samples due to skid on the PMU interrupt. To block this, we
bailed out early in perf_event_overflow(), as we do for non-sampling
events.

This broke rr, which uses sampling events to receive a signal on
overflow (but does not care about the contents of the sample). These
signals are critical to the correct operation of rr.

Instead of bailing out early in perf_event_overflow, we can bail prior
to performing the actual sampling in __perf_event_output(). This avoids
the information leak, but preserves the generation of the signal.

Since we don't place any sample data into the ring buffer, the signal is
arguably spurious. However, a userspace ringbuffer consumer can already
consume data prior to taking the associated signals, and therefore must
handle spurious signals to operate correctly. Thus, this signal
shouldn't be harmful.

Fixes: cc1582c231ea041f ("perf/core: Drop kernel samples even though :u is specified")
Reported-by: Kyle Huey <[email protected]>
Signed-off-by: Mark Rutland <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jin Yao <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
---
kernel/events/core.c | 46 +++++++++++++++++++++++++---------------------
1 file changed, 25 insertions(+), 21 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6c4e523..6b263f3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6090,6 +6090,21 @@ void perf_prepare_sample(struct perf_event_header *header,
}
}

+static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)
+{
+ /*
+ * Due to interrupt latency (AKA "skid"), we may enter the
+ * kernel before taking an overflow, even if the PMU is only
+ * counting user events.
+ * To avoid leaking information to userspace, we must always
+ * reject kernel samples when exclude_kernel is set.
+ */
+ if (event->attr.exclude_kernel && !user_mode(regs))
+ return false;
+
+ return true;
+}
+
static void __always_inline
__perf_event_output(struct perf_event *event,
struct perf_sample_data *data,
@@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header,
struct perf_output_handle handle;
struct perf_event_header header;

+ /*
+ * For security, drop the skid kernel samples if necessary.
+ */
+ if (!sample_is_allowed(event, regs))
+ return ret;
+
/* protect the callchain buffers */
rcu_read_lock();

@@ -7316,21 +7337,6 @@ int perf_event_account_interrupt(struct perf_event *event)
return __perf_event_account_interrupt(event, 1);
}

-static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)
-{
- /*
- * Due to interrupt latency (AKA "skid"), we may enter the
- * kernel before taking an overflow, even if the PMU is only
- * counting user events.
- * To avoid leaking information to userspace, we must always
- * reject kernel samples when exclude_kernel is set.
- */
- if (event->attr.exclude_kernel && !user_mode(regs))
- return false;
-
- return true;
-}
-
/*
* Generic event overflow handling, sampling.
*/
@@ -7352,12 +7358,6 @@ static int __perf_event_overflow(struct perf_event *event,
ret = __perf_event_account_interrupt(event, throttle);

/*
- * For security, drop the skid kernel samples if necessary.
- */
- if (!sample_is_allowed(event, regs))
- return ret;
-
- /*
* XXX event_limit might not quite work as expected on inherited
* events
*/
@@ -7372,6 +7372,10 @@ static int __perf_event_overflow(struct perf_event *event,

READ_ONCE(event->overflow_handler)(event, data, regs);

+ /*
+ * We must generate a wakeup regardless of whether we actually
+ * generated a sample. This is relied upon by rr.
+ */
if (*perf_event_fasync(event) && event->pending_kill) {
event->pending_wakeup = 1;
irq_work_queue(&event->pending);
--
1.9.1

2017-06-28 12:40:45

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Wed, 28 Jun 2017, Mark Rutland wrote:

> On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote:

> Instead of bailing out early in perf_event_overflow, we can bail prior
> to performing the actual sampling in __perf_event_output(). This avoids
> the information leak, but preserves the generation of the signal.
>
> Since we don't place any sample data into the ring buffer, the signal is
> arguably spurious. However, a userspace ringbuffer consumer can already
> consume data prior to taking the associated signals, and therefore must
> handle spurious signals to operate correctly. Thus, this signal
> shouldn't be harmful.

this could still break some of my perf_event validation tests.

Ones that set up a sampling event for every 1M instructions, run for 100M
instructions, and expect there to be 100 samples received.

If we're so worried about info leakage, can't we just zero-out the problem
address (or randomize the kernel address) rather than just pretending the
interrupt didn't happen?

Vince

2017-06-28 13:08:50

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Wed, Jun 28, 2017 at 08:40:30AM -0400, Vince Weaver wrote:
> On Wed, 28 Jun 2017, Mark Rutland wrote:
>
> > On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote:
>
> > Instead of bailing out early in perf_event_overflow, we can bail prior
> > to performing the actual sampling in __perf_event_output(). This avoids
> > the information leak, but preserves the generation of the signal.
> >
> > Since we don't place any sample data into the ring buffer, the signal is
> > arguably spurious. However, a userspace ringbuffer consumer can already
> > consume data prior to taking the associated signals, and therefore must
> > handle spurious signals to operate correctly. Thus, this signal
> > shouldn't be harmful.
>
> this could still break some of my perf_event validation tests.
>
> Ones that set up a sampling event for every 1M instructions, run for 100M
> instructions, and expect there to be 100 samples received.

Is that test reliable today?

I'd expect that at least on ARM it's not, given that events can be
counted imprecisely, and mode filters can be applied imprecisely. So you
might get fewer (or more) samples. I'd imagine similar is true on other
archtiectures.

If sampling took long enough, the existing ratelimiting could come into
effect, too.

Surely that already has some error margin?

> If we're so worried about info leakage, can't we just zero-out the problem
> address (or randomize the kernel address) rather than just pretending the
> interrupt didn't happen?

Making up zeroed or randomized data is going to confuse users. I can't
imagine that real users are going to want bogus samples that they have
to identify (somehow) in order to skip when processing the data.

I can see merit in signalling "lost" samples to userspace, so long as
they're easily distinguished from real samples.

One option is to fake up a sample using the user regs regardless, but
that's both fragile and surprising in other cases.

Thanks,
Mark.

2017-06-28 16:46:51

by Kyle Huey

[permalink] [raw]
Subject: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

On Wed, Jun 28, 2017 at 3:12 AM, Mark Rutland <[email protected]> wrote:
> On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote:
>> On Tue, Jun 27, 2017 at 7:09 PM, Jin, Yao <[email protected]> wrote:
>> > Hi,
>> >
>> > In theory, the PMI interrupts in skid region should be dropped, right?
>>
>> No, why would they be dropped?
>>
>> My understanding of the situation is as follows:
>>
>> There is some time, call it t_0, where the hardware counter overflows.
>> The PMU triggers an interrupt, but this is not instantaneous. Call
>> the time when the interrupt is actually delivered t_1. Then t_1 - t_0
>> is the "skid".
>>
>> Note that if the counter is `exclude_kernel`, then at t_0 the CPU
>> *must* be running a userspace program. But by t_1, the CPU may be
>> doing something else. Your patch changed things so that if at t_1 the
>> CPU is in the kernel, then the interrupt is discarded. But rr has
>> programmed the counter to deliver a signal on overflow (via F_SETSIG
>> on the fd returned by perf_event_open). This change results in the
>> signal never being delivered, because the interrupt was ignored.
>> (More accurately, the signal is delivered the *next* time the counter
>> overflows, which is far past where we wanted to inject our
>> asynchronous event into our tracee.
>
> Yes, this is a bug.
>
> As we're trying to avoid smapling state, I think we can move the check
> into perf_prepare_sample() or __perf_event_output(), where that state is
> actually sampled. I'll take a look at that momentarily.
>
> Just to clarify, you don't care about the sample state at all? i.e. you
> don't need the user program counter?

Right. `sample_regs_user`, `sample_star_user`, `branch_sample_type`,
etc are all 0.
https://github.com/mozilla/rr/blob/cf594dd01f07d96a61409e9f41a29f78c8c51693/src/PerfCounters.cc#L194
is what we do use.

> Is that signal delivered to the tracee, or to a different process that
> traces it? If the latter, what ensures that the task is stopped
> sufficiently quickly?

It's delivered to the tracee (via an F_SETOWN_EX with the tracee tid).
In practice we've found that on modern Intel hardware that the
interrupt and resulting signal delivery delay is bounded by a
relatively small number of counter events.

>> It seems to me that it might be reasonable to ignore the interrupt if
>> the purpose of the interrupt is to trigger sampling of the CPUs
>> register state. But if the interrupt will trigger some other
>> operation, such as a signal on an fd, then there's no reason to drop
>> it.
>
> Agreed. I'll try to have a patch for this soon.
>
> I just need to figure out exactly where that overflow signal is
> generated by the perf core.
>
> Thanks,
> Mark.

- Kyle

2017-06-28 16:48:37

by Kyle Huey

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <[email protected]> wrote:
> On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote:
>> On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote:
>
>> As we're trying to avoid smapling state, I think we can move the check
>> into perf_prepare_sample() or __perf_event_output(), where that state is
>> actually sampled. I'll take a look at that momentarily.
>>
>> Just to clarify, you don't care about the sample state at all? i.e. you
>> don't need the user program counter?
>>
>> Is that signal delivered to the tracee, or to a different process that
>> traces it? If the latter, what ensures that the task is stopped
>> sufficiently quickly?
>>
>> > It seems to me that it might be reasonable to ignore the interrupt if
>> > the purpose of the interrupt is to trigger sampling of the CPUs
>> > register state. But if the interrupt will trigger some other
>> > operation, such as a signal on an fd, then there's no reason to drop
>> > it.
>>
>> Agreed. I'll try to have a patch for this soon.
>>
>> I just need to figure out exactly where that overflow signal is
>> generated by the perf core.
>
> I've figured that out now. That's handled by perf_pending_event(), whcih
> is the irq_work we schedule in __perf_event_overflow().
>
> Does the below patch work for you?
>
> ---->8----
> From bb1f99dace508ce34ab0818f91d59e73450fa142 Mon Sep 17 00:00:00 2001
> From: Mark Rutland <[email protected]>
> Date: Wed, 28 Jun 2017 11:39:25 +0100
> Subject: [PATCH] perf/core: generate overflow signal when samples are dropped
>
> We recently tried to kill an information leak where users could receive
> kernel samples due to skid on the PMU interrupt. To block this, we
> bailed out early in perf_event_overflow(), as we do for non-sampling
> events.
>
> This broke rr, which uses sampling events to receive a signal on
> overflow (but does not care about the contents of the sample). These
> signals are critical to the correct operation of rr.
>
> Instead of bailing out early in perf_event_overflow, we can bail prior
> to performing the actual sampling in __perf_event_output(). This avoids
> the information leak, but preserves the generation of the signal.
>
> Since we don't place any sample data into the ring buffer, the signal is
> arguably spurious. However, a userspace ringbuffer consumer can already
> consume data prior to taking the associated signals, and therefore must
> handle spurious signals to operate correctly. Thus, this signal
> shouldn't be harmful.
>
> Fixes: cc1582c231ea041f ("perf/core: Drop kernel samples even though :u is specified")
> Reported-by: Kyle Huey <[email protected]>
> Signed-off-by: Mark Rutland <[email protected]>
> Cc: Alexander Shishkin <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jin Yao <[email protected]>
> Cc: Peter Zijlstra (Intel) <[email protected]>
> ---
> kernel/events/core.c | 46 +++++++++++++++++++++++++---------------------
> 1 file changed, 25 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 6c4e523..6b263f3 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6090,6 +6090,21 @@ void perf_prepare_sample(struct perf_event_header *header,
> }
> }
>
> +static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)
> +{
> + /*
> + * Due to interrupt latency (AKA "skid"), we may enter the
> + * kernel before taking an overflow, even if the PMU is only
> + * counting user events.
> + * To avoid leaking information to userspace, we must always
> + * reject kernel samples when exclude_kernel is set.
> + */
> + if (event->attr.exclude_kernel && !user_mode(regs))
> + return false;
> +
> + return true;
> +}
> +
> static void __always_inline
> __perf_event_output(struct perf_event *event,
> struct perf_sample_data *data,
> @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header,
> struct perf_output_handle handle;
> struct perf_event_header header;
>
> + /*
> + * For security, drop the skid kernel samples if necessary.
> + */
> + if (!sample_is_allowed(event, regs))
> + return ret;

Just a bare return here.

> +
> /* protect the callchain buffers */
> rcu_read_lock();
>
> @@ -7316,21 +7337,6 @@ int perf_event_account_interrupt(struct perf_event *event)
> return __perf_event_account_interrupt(event, 1);
> }
>
> -static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)
> -{
> - /*
> - * Due to interrupt latency (AKA "skid"), we may enter the
> - * kernel before taking an overflow, even if the PMU is only
> - * counting user events.
> - * To avoid leaking information to userspace, we must always
> - * reject kernel samples when exclude_kernel is set.
> - */
> - if (event->attr.exclude_kernel && !user_mode(regs))
> - return false;
> -
> - return true;
> -}
> -
> /*
> * Generic event overflow handling, sampling.
> */
> @@ -7352,12 +7358,6 @@ static int __perf_event_overflow(struct perf_event *event,
> ret = __perf_event_account_interrupt(event, throttle);
>
> /*
> - * For security, drop the skid kernel samples if necessary.
> - */
> - if (!sample_is_allowed(event, regs))
> - return ret;
> -
> - /*
> * XXX event_limit might not quite work as expected on inherited
> * events
> */
> @@ -7372,6 +7372,10 @@ static int __perf_event_overflow(struct perf_event *event,
>
> READ_ONCE(event->overflow_handler)(event, data, regs);
>
> + /*
> + * We must generate a wakeup regardless of whether we actually
> + * generated a sample. This is relied upon by rr.
> + */
> if (*perf_event_fasync(event) && event->pending_kill) {
> event->pending_wakeup = 1;
> irq_work_queue(&event->pending);
> --
> 1.9.1
>

I can confirm that with that fixed to compile, this patch fixes rr.

Thanks!

- Kyle

2017-06-28 17:20:47

by Mark Rutland

[permalink] [raw]
Subject: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

On Wed, Jun 28, 2017 at 09:46:43AM -0700, Kyle Huey wrote:
> On Wed, Jun 28, 2017 at 3:12 AM, Mark Rutland <[email protected]> wrote:
> > On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote:
> >> My understanding of the situation is as follows:
> >>
> >> There is some time, call it t_0, where the hardware counter overflows.
> >> The PMU triggers an interrupt, but this is not instantaneous. Call
> >> the time when the interrupt is actually delivered t_1. Then t_1 - t_0
> >> is the "skid".
> >>
> >> Note that if the counter is `exclude_kernel`, then at t_0 the CPU
> >> *must* be running a userspace program. But by t_1, the CPU may be
> >> doing something else. Your patch changed things so that if at t_1 the
> >> CPU is in the kernel, then the interrupt is discarded. But rr has
> >> programmed the counter to deliver a signal on overflow (via F_SETSIG
> >> on the fd returned by perf_event_open). This change results in the
> >> signal never being delivered, because the interrupt was ignored.
> >> (More accurately, the signal is delivered the *next* time the counter
> >> overflows, which is far past where we wanted to inject our
> >> asynchronous event into our tracee.
> >
> > Yes, this is a bug.
> >
> > As we're trying to avoid smapling state, I think we can move the check
> > into perf_prepare_sample() or __perf_event_output(), where that state is
> > actually sampled. I'll take a look at that momentarily.
> >
> > Just to clarify, you don't care about the sample state at all? i.e. you
> > don't need the user program counter?
>
> Right. `sample_regs_user`, `sample_star_user`, `branch_sample_type`,
> etc are all 0.
> https://github.com/mozilla/rr/blob/cf594dd01f07d96a61409e9f41a29f78c8c51693/src/PerfCounters.cc#L194
> is what we do use.

Given that, I must be missing something.

In __perf_event_overflow(), we already bail out early if
!is_sampling_event(event), i.e. when the sample_period is 0.

Your attr has a sample_period of zero, so something must be initialising
that.

Do you always call PERF_EVENT_IOC_PERIOD, or is something in the core
fiddling with the sample period behind your back?

It seems odd that an event without any samples to take has a sample
period. I'm surprised that there's not *some* sample_type set.

> > Is that signal delivered to the tracee, or to a different process that
> > traces it? If the latter, what ensures that the task is stopped
> > sufficiently quickly?
>
> It's delivered to the tracee (via an F_SETOWN_EX with the tracee tid).
> In practice we've found that on modern Intel hardware that the
> interrupt and resulting signal delivery delay is bounded by a
> relatively small number of counter events.

Ok.

Thanks,
Mark.

2017-06-28 17:36:39

by Kyle Huey

[permalink] [raw]
Subject: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

On Wed, Jun 28, 2017 at 10:19 AM, Mark Rutland <[email protected]> wrote:
> On Wed, Jun 28, 2017 at 09:46:43AM -0700, Kyle Huey wrote:
>> On Wed, Jun 28, 2017 at 3:12 AM, Mark Rutland <[email protected]> wrote:
>> > On Tue, Jun 27, 2017 at 09:51:00PM -0700, Kyle Huey wrote:
>> >> My understanding of the situation is as follows:
>> >>
>> >> There is some time, call it t_0, where the hardware counter overflows.
>> >> The PMU triggers an interrupt, but this is not instantaneous. Call
>> >> the time when the interrupt is actually delivered t_1. Then t_1 - t_0
>> >> is the "skid".
>> >>
>> >> Note that if the counter is `exclude_kernel`, then at t_0 the CPU
>> >> *must* be running a userspace program. But by t_1, the CPU may be
>> >> doing something else. Your patch changed things so that if at t_1 the
>> >> CPU is in the kernel, then the interrupt is discarded. But rr has
>> >> programmed the counter to deliver a signal on overflow (via F_SETSIG
>> >> on the fd returned by perf_event_open). This change results in the
>> >> signal never being delivered, because the interrupt was ignored.
>> >> (More accurately, the signal is delivered the *next* time the counter
>> >> overflows, which is far past where we wanted to inject our
>> >> asynchronous event into our tracee.
>> >
>> > Yes, this is a bug.
>> >
>> > As we're trying to avoid smapling state, I think we can move the check
>> > into perf_prepare_sample() or __perf_event_output(), where that state is
>> > actually sampled. I'll take a look at that momentarily.
>> >
>> > Just to clarify, you don't care about the sample state at all? i.e. you
>> > don't need the user program counter?
>>
>> Right. `sample_regs_user`, `sample_star_user`, `branch_sample_type`,
>> etc are all 0.
>> https://github.com/mozilla/rr/blob/cf594dd01f07d96a61409e9f41a29f78c8c51693/src/PerfCounters.cc#L194
>> is what we do use.
>
> Given that, I must be missing something.
>
> In __perf_event_overflow(), we already bail out early if
> !is_sampling_event(event), i.e. when the sample_period is 0.
>
> Your attr has a sample_period of zero, so something must be initialising
> that.
>
> Do you always call PERF_EVENT_IOC_PERIOD, or is something in the core
> fiddling with the sample period behind your back?

We always either set sample_period or call PERF_EVENT_IOC_PERIOD (with
an enormous number if we don't actually want an interrupt. See
`PerfCounters::reset`, line 446.

> It seems odd that an event without any samples to take has a sample
> period. I'm surprised that there's not *some* sample_type set.

Perhaps sample_period is misleadingly named :) Alternatively, you
could imagine it as sampling where we're only interested in whether
the counter passed the sampling value or not.

>> > Is that signal delivered to the tracee, or to a different process that
>> > traces it? If the latter, what ensures that the task is stopped
>> > sufficiently quickly?
>>
>> It's delivered to the tracee (via an F_SETOWN_EX with the tracee tid).
>> In practice we've found that on modern Intel hardware that the
>> interrupt and resulting signal delivery delay is bounded by a
>> relatively small number of counter events.
>
> Ok.
>
> Thanks,
> Mark.

- Kyle

2017-06-28 17:48:44

by Robert O'Callahan

[permalink] [raw]
Subject: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

On Wed, Jun 28, 2017 at 10:19 AM, Mark Rutland <[email protected]> wrote:
> It seems odd that an event without any samples to take has a sample
> period. I'm surprised that there's not *some* sample_type set.

Yes, it is a bit odd :-). Rather than trying to gather some limited
set of information about the target process, we want to interrupt its
execution after a particular number of events have occurred.

The bigger picture here is that we're replaying a previously recorded
process execution. We want to the process to run until it reaches a
particular state that occurred during recording; that state is
identified by a value for the performance counter (retired conditional
branches) plus the values of the general-purpose registers. So our
first step is to run the process until the performance counter reaches
a value less than but "close to" the target value. (We know the
process will be interrupted after a number of extra events have
occurred, so we set the sample period to a value less than the target
by a heuristic "maximum skid size".)

Rob
--
lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD
selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr
.a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn

2017-06-28 17:50:01

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote:
> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <[email protected]> wrote:
> > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header,
> > struct perf_output_handle handle;
> > struct perf_event_header header;
> >
> > + /*
> > + * For security, drop the skid kernel samples if necessary.
> > + */
> > + if (!sample_is_allowed(event, regs))
> > + return ret;
>
> Just a bare return here.

Ugh, yes. Sorry about that. I'll fix that up.

[...]

> I can confirm that with that fixed to compile, this patch fixes rr.

Thanks for giving this a go.

Having thought about this some more, I think Vince does make a good
point that throwing away samples is liable to break stuff, e.g. that
which only relies on (non-sensitive) samples.

It still seems wrong to make up data, though.

Maybe for exclude_kernel && !exclude_user events we can always generate
samples from the user regs, rather than the exception regs. That's going
to be closer to what the user wants, regardless. I'll take a look
tomorrow.

Thanks,
Mark.

2017-06-28 17:53:44

by Mark Rutland

[permalink] [raw]
Subject: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

On Wed, Jun 28, 2017 at 10:36:20AM -0700, Kyle Huey wrote:
> On Wed, Jun 28, 2017 at 10:19 AM, Mark Rutland <[email protected]> wrote:
> > On Wed, Jun 28, 2017 at 09:46:43AM -0700, Kyle Huey wrote:
> >> On Wed, Jun 28, 2017 at 3:12 AM, Mark Rutland <[email protected]> wrote:

> >> > Just to clarify, you don't care about the sample state at all? i.e. you
> >> > don't need the user program counter?
> >>
> >> Right. `sample_regs_user`, `sample_star_user`, `branch_sample_type`,
> >> etc are all 0.
> >> https://github.com/mozilla/rr/blob/cf594dd01f07d96a61409e9f41a29f78c8c51693/src/PerfCounters.cc#L194
> >> is what we do use.
> >
> > Given that, I must be missing something.
> >
> > In __perf_event_overflow(), we already bail out early if
> > !is_sampling_event(event), i.e. when the sample_period is 0.
> >
> > Your attr has a sample_period of zero, so something must be initialising
> > that.
> >
> > Do you always call PERF_EVENT_IOC_PERIOD, or is something in the core
> > fiddling with the sample period behind your back?
>
> We always either set sample_period or call PERF_EVENT_IOC_PERIOD (with
> an enormous number if we don't actually want an interrupt. See
> `PerfCounters::reset`, line 446.

Ah, thanks for the pointer.

> > It seems odd that an event without any samples to take has a sample
> > period. I'm surprised that there's not *some* sample_type set.
>
> Perhaps sample_period is misleadingly named :) Alternatively, you
> could imagine it as sampling where we're only interested in whether
> the counter passed the sampling value or not.

Sure; it's just that I suspect the existing kernel behviour isn't
*quite* intentional, and I could easily see it getting broken in future,
e.g. if someone were to make is_sampling_event() check the attr for
sample types.

So we need to keep an eye on that, regardless.

Thanks,
Mark.

2017-06-28 22:55:13

by Kyle Huey

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland <[email protected]> wrote:
> On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote:
>> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <[email protected]> wrote:
>> > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header,
>> > struct perf_output_handle handle;
>> > struct perf_event_header header;
>> >
>> > + /*
>> > + * For security, drop the skid kernel samples if necessary.
>> > + */
>> > + if (!sample_is_allowed(event, regs))
>> > + return ret;
>>
>> Just a bare return here.
>
> Ugh, yes. Sorry about that. I'll fix that up.
>
> [...]
>
>> I can confirm that with that fixed to compile, this patch fixes rr.
>
> Thanks for giving this a go.
>
> Having thought about this some more, I think Vince does make a good
> point that throwing away samples is liable to break stuff, e.g. that
> which only relies on (non-sensitive) samples.
>
> It still seems wrong to make up data, though.
>
> Maybe for exclude_kernel && !exclude_user events we can always generate
> samples from the user regs, rather than the exception regs. That's going
> to be closer to what the user wants, regardless. I'll take a look
> tomorrow.

I'm not very familiar with the kernel internals, but the reason I
didn't suggest this originally is it seems like it will be difficult
to determine what the "correct" userspace registers are. For example,
what happens if a performance counter is fixed to a given tid, the
interrupt fires during a context switch from that task to another that
is not being monitored, and the kernel is far enough along in the
context switch that the current task struct has been switched out?
Reporting the new task's registers seems as bad as reporting the
kernel's registers. But maybe this is easier than I imagine for
whatever reason.

Something to think about.

- Kyle

2017-06-29 00:27:25

by Jin Yao

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)



On 6/29/2017 6:55 AM, Kyle Huey wrote:
> On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland <[email protected]> wrote:
>> On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote:
>>> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <[email protected]> wrote:
>>>> @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header,
>>>> struct perf_output_handle handle;
>>>> struct perf_event_header header;
>>>>
>>>> + /*
>>>> + * For security, drop the skid kernel samples if necessary.
>>>> + */
>>>> + if (!sample_is_allowed(event, regs))
>>>> + return ret;
>>> Just a bare return here.
>> Ugh, yes. Sorry about that. I'll fix that up.
>>
>> [...]
>>
>>> I can confirm that with that fixed to compile, this patch fixes rr.
>> Thanks for giving this a go.
>>
>> Having thought about this some more, I think Vince does make a good
>> point that throwing away samples is liable to break stuff, e.g. that
>> which only relies on (non-sensitive) samples.
>>
>> It still seems wrong to make up data, though.
>>
>> Maybe for exclude_kernel && !exclude_user events we can always generate
>> samples from the user regs, rather than the exception regs. That's going
>> to be closer to what the user wants, regardless. I'll take a look
>> tomorrow.
> I'm not very familiar with the kernel internals, but the reason I
> didn't suggest this originally is it seems like it will be difficult
> to determine what the "correct" userspace registers are. For example,
> what happens if a performance counter is fixed to a given tid, the
> interrupt fires during a context switch from that task to another that
> is not being monitored, and the kernel is far enough along in the
> context switch that the current task struct has been switched out?
> Reporting the new task's registers seems as bad as reporting the
> kernel's registers. But maybe this is easier than I imagine for
> whatever reason.
>
> Something to think about.
>
> - Kyle

Yes, I think so.

The skid interrupt may be triggered at a wrong context and return wrong
indications (e.g. wrong regs) to userspace.

So that's why I think the *skid* interrupt had better be dropped.

Thanks
Jin Yao




2017-06-29 08:12:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)


* Mark Rutland <[email protected]> wrote:

> It still seems wrong to make up data, though.

So what we have here is a hardware quirk: we asked for user-space samples, but
didn't get them and we cannot expose the kernel-internal address.

The question is, how do we handle the hardware quirk. Since we cannot fix the
hardware on existing systems there's really just two choices:

- Lose the sample (and signal it as a lost sample)

- Keep the sample but change the sensitive kernel-internal address to something
that is not sensitive: 0 or -1 works, but we could perhaps also return a
well-known user-space address such as the vDSO syscall trampoline or such?

there's no other option really.

I'd lean towards Vince's take: losing samples is more surprising than getting the
occasional sample with some sanitized data in it.

If we make the artificial data still a meaningful user-space address, related to
kernel entries, then it might even be a bonus, as users would learn to recognize
it as: 'oh, skid artifact, I know about that'.

Thanks,

Ingo

2017-06-29 08:14:06

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

Hi Folks,

On 28.06.2017 16:07, Mark Rutland wrote:
> On Wed, Jun 28, 2017 at 08:40:30AM -0400, Vince Weaver wrote:
>> On Wed, 28 Jun 2017, Mark Rutland wrote:
>>
>>> On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote:
>>
>>> Instead of bailing out early in perf_event_overflow, we can bail prior
>>> to performing the actual sampling in __perf_event_output(). This avoids
>>> the information leak, but preserves the generation of the signal.
>>>
>>> Since we don't place any sample data into the ring buffer, the signal is
>>> arguably spurious. However, a userspace ringbuffer consumer can already
>>> consume data prior to taking the associated signals, and therefore must
>>> handle spurious signals to operate correctly. Thus, this signal
>>> shouldn't be harmful.
>>
>> this could still break some of my perf_event validation tests.
>>
>> Ones that set up a sampling event for every 1M instructions, run for 100M
>> instructions, and expect there to be 100 samples received.
>
> Is that test reliable today?
>
> I'd expect that at least on ARM it's not, given that events can be
> counted imprecisely, and mode filters can be applied imprecisely. So you
> might get fewer (or more) samples. I'd imagine similar is true on other
> archtiectures.
>
> If sampling took long enough, the existing ratelimiting could come into
> effect, too.
>
> Surely that already has some error margin?

FYI.

>From my recent experience and observation (on Intel Xeon Phi)
wakeup_events_overflow and overflow_poll tests may fail if
/proc/sys/kernel/perf_event_max_sample_rate is low enough:

# cat /proc/sys/kernel/perf_event_max_sample_rate
6000
# abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow
This tests wakeup event overflows.
Testing with wakeup_events=1.
Counts, using mmap buffer 0x7f707b0dc000
POLL_IN : 10
POLL_OUT: 0
POLL_MSG: 0
POLL_ERR: 0
POLL_PRI: 0
POLL_HUP: 0
UNKNOWN : 0
Testing wakeup events overflow... PASSED
# abudanko/perf_event_tests/tests/overflow/overflow_poll
This tests using poll() to catch overflow.
Monitoring pid 131412 status 1407
Child has stopped due to signal 5 (Trace/breakpoint trap)
Continuing child
Returned HUP!
Counts, using mmap buffer 0x7fb3bfe04000
POLL_IN : 10
POLL_OUT: 0
POLL_MSG: 0
POLL_ERR: 0
POLL_PRI: 0
POLL_HUP: 1
UNKNOWN : 0
Testing catching overflow with poll()... PASSED

# echo 1000 > /proc/sys/kernel/perf_event_max_sample_rate
# abudanko/perf_event_tests/tests/overflow/overflow_poll
This tests using poll() to catch overflow.
Monitoring pid 131551 status 1407
Child has stopped due to signal 5 (Trace/breakpoint trap)
Continuing child
Returned HUP!
Counts, using mmap buffer 0x7f80532df000
POLL_IN : 9
POLL_OUT: 0
POLL_MSG: 0
POLL_ERR: 0
POLL_PRI: 0
POLL_HUP: 1
UNKNOWN : 0
Unexpected POLL_IN interrupt.
Testing catching overflow with poll()... FAILED
[root@nntpdsd52-210 ~]# abudanko/perf_event_tests/tests/overflow/overflow_poll
This tests using poll() to catch overflow.
Monitoring pid 131553 status 1407
Child has stopped due to signal 5 (Trace/breakpoint trap)
Continuing child
Returned HUP!
Counts, using mmap buffer 0x7f650952c000
POLL_IN : 9
POLL_OUT: 0
POLL_MSG: 0
POLL_ERR: 0
POLL_PRI: 0
POLL_HUP: 1
UNKNOWN : 0
Unexpected POLL_IN interrupt.
Testing catching overflow with poll()... FAILED

>
>> If we're so worried about info leakage, can't we just zero-out the problem
>> address (or randomize the kernel address) rather than just pretending the
>> interrupt didn't happen?
>
> Making up zeroed or randomized data is going to confuse users. I can't
> imagine that real users are going to want bogus samples that they have
> to identify (somehow) in order to skip when processing the data.
>
> I can see merit in signalling "lost" samples to userspace, so long as
> they're easily distinguished from real samples.
>
> One option is to fake up a sample using the user regs regardless, but
> that's both fragile and surprising in other cases.
>
> Thanks,
> Mark.
>

Thanks,
Alexey


2017-06-29 08:26:19

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On 29.06.2017 11:13, Alexey Budankov wrote:
> Hi Folks,
>
> On 28.06.2017 16:07, Mark Rutland wrote:
>> On Wed, Jun 28, 2017 at 08:40:30AM -0400, Vince Weaver wrote:
>>> On Wed, 28 Jun 2017, Mark Rutland wrote:
>>>
>>>> On Wed, Jun 28, 2017 at 11:12:48AM +0100, Mark Rutland wrote:
>>>
>>>> Instead of bailing out early in perf_event_overflow, we can bail prior
>>>> to performing the actual sampling in __perf_event_output(). This avoids
>>>> the information leak, but preserves the generation of the signal.
>>>>
>>>> Since we don't place any sample data into the ring buffer, the signal is
>>>> arguably spurious. However, a userspace ringbuffer consumer can already
>>>> consume data prior to taking the associated signals, and therefore must
>>>> handle spurious signals to operate correctly. Thus, this signal
>>>> shouldn't be harmful.
>>>
>>> this could still break some of my perf_event validation tests.
>>>
>>> Ones that set up a sampling event for every 1M instructions, run for 100M
>>> instructions, and expect there to be 100 samples received.
>>
>> Is that test reliable today?
>>
>> I'd expect that at least on ARM it's not, given that events can be
>> counted imprecisely, and mode filters can be applied imprecisely. So you
>> might get fewer (or more) samples. I'd imagine similar is true on other
>> archtiectures.
>>
>> If sampling took long enough, the existing ratelimiting could come into
>> effect, too.
>>
>> Surely that already has some error margin?
>
> FYI.
>
>>From my recent experience and observation (on Intel Xeon Phi)
> wakeup_events_overflow and overflow_poll tests may fail if
> /proc/sys/kernel/perf_event_max_sample_rate is low enough:
>
> # cat /proc/sys/kernel/perf_event_max_sample_rate
> 6000
> # abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow
> This tests wakeup event overflows.
> Testing with wakeup_events=1.
> Counts, using mmap buffer 0x7f707b0dc000
> POLL_IN : 10
> POLL_OUT: 0
> POLL_MSG: 0
> POLL_ERR: 0
> POLL_PRI: 0
> POLL_HUP: 0
> UNKNOWN : 0
> Testing wakeup events overflow... PASSED
> # abudanko/perf_event_tests/tests/overflow/overflow_poll
> This tests using poll() to catch overflow.
> Monitoring pid 131412 status 1407
> Child has stopped due to signal 5 (Trace/breakpoint trap)
> Continuing child
> Returned HUP!
> Counts, using mmap buffer 0x7fb3bfe04000
> POLL_IN : 10
> POLL_OUT: 0
> POLL_MSG: 0
> POLL_ERR: 0
> POLL_PRI: 0
> POLL_HUP: 1
> UNKNOWN : 0
> Testing catching overflow with poll()... PASSED
>
> # echo 1000 > /proc/sys/kernel/perf_event_max_sample_rate
> # abudanko/perf_event_tests/tests/overflow/overflow_poll
> This tests using poll() to catch overflow.
> Monitoring pid 131551 status 1407
> Child has stopped due to signal 5 (Trace/breakpoint trap)
> Continuing child
> Returned HUP!
> Counts, using mmap buffer 0x7f80532df000
> POLL_IN : 9
> POLL_OUT: 0
> POLL_MSG: 0
> POLL_ERR: 0
> POLL_PRI: 0
> POLL_HUP: 1
> UNKNOWN : 0
> Unexpected POLL_IN interrupt.
> Testing catching overflow with poll()... FAILED
> [root@nntpdsd52-210 ~]# abudanko/perf_event_tests/tests/overflow/overflow_poll
> This tests using poll() to catch overflow.
> Monitoring pid 131553 status 1407
> Child has stopped due to signal 5 (Trace/breakpoint trap)
> Continuing child
> Returned HUP!
> Counts, using mmap buffer 0x7f650952c000
> POLL_IN : 9
> POLL_OUT: 0
> POLL_MSG: 0
> POLL_ERR: 0
> POLL_PRI: 0
> POLL_HUP: 1
> UNKNOWN : 0
> Unexpected POLL_IN interrupt.
> Testing catching overflow with poll()... FAILED

More of the other test:

# echo 1000 > /proc/sys/kernel/perf_event_max_sample_rate
[ ~]# abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow
This tests wakeup event overflows.
Testing with wakeup_events=1.
Counts, using mmap buffer 0x7fe65deff000
POLL_IN : 4
POLL_OUT: 0
POLL_MSG: 0
POLL_ERR: 0
POLL_PRI: 0
POLL_HUP: 0
UNKNOWN : 0
POLL_IN value 4, expected 10.
Testing wakeup events overflow... FAILED
[ ~]# abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow
This tests wakeup event overflows.
Testing with wakeup_events=1.
Counts, using mmap buffer 0x7f818a732000
POLL_IN : 2
POLL_OUT: 0
POLL_MSG: 0
POLL_ERR: 0
POLL_PRI: 0
POLL_HUP: 0
UNKNOWN : 0
POLL_IN value 2, expected 10.
Testing wakeup events overflow... FAILED
[ ~]# abudanko/perf_event_tests/tests/overflow/wakeup_events_overflow
This tests wakeup event overflows.
Testing with wakeup_events=1.
Counts, using mmap buffer 0x7fb3675c5000
POLL_IN : 4
POLL_OUT: 0
POLL_MSG: 0
POLL_ERR: 0
POLL_PRI: 0
POLL_HUP: 0
UNKNOWN : 0
POLL_IN value 4, expected 10.
Testing wakeup events overflow... FAILED

>
>>
>>> If we're so worried about info leakage, can't we just zero-out the problem
>>> address (or randomize the kernel address) rather than just pretending the
>>> interrupt didn't happen?
>>
>> Making up zeroed or randomized data is going to confuse users. I can't
>> imagine that real users are going to want bogus samples that they have
>> to identify (somehow) in order to skip when processing the data.
>>
>> I can see merit in signalling "lost" samples to userspace, so long as
>> they're easily distinguished from real samples.
>>
>> One option is to fake up a sample using the user regs regardless, but
>> that's both fragile and surprising in other cases.
>>
>> Thanks,
>> Mark.
>>
>
> Thanks,
> Alexey
>
>
>

2017-06-30 17:44:21

by Kyle Huey

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Wed, Jun 28, 2017 at 3:55 PM, Kyle Huey <[email protected]> wrote:
> On Wed, Jun 28, 2017 at 10:49 AM, Mark Rutland <[email protected]> wrote:
>> On Wed, Jun 28, 2017 at 09:48:27AM -0700, Kyle Huey wrote:
>>> On Wed, Jun 28, 2017 at 3:56 AM, Mark Rutland <[email protected]> wrote:
>>> > @@ -6101,6 +6116,12 @@ void perf_prepare_sample(struct perf_event_header *header,
>>> > struct perf_output_handle handle;
>>> > struct perf_event_header header;
>>> >
>>> > + /*
>>> > + * For security, drop the skid kernel samples if necessary.
>>> > + */
>>> > + if (!sample_is_allowed(event, regs))
>>> > + return ret;
>>>
>>> Just a bare return here.
>>
>> Ugh, yes. Sorry about that. I'll fix that up.
>>
>> [...]
>>
>>> I can confirm that with that fixed to compile, this patch fixes rr.
>>
>> Thanks for giving this a go.
>>
>> Having thought about this some more, I think Vince does make a good
>> point that throwing away samples is liable to break stuff, e.g. that
>> which only relies on (non-sensitive) samples.
>>
>> It still seems wrong to make up data, though.
>>
>> Maybe for exclude_kernel && !exclude_user events we can always generate
>> samples from the user regs, rather than the exception regs. That's going
>> to be closer to what the user wants, regardless. I'll take a look
>> tomorrow.
>
> I'm not very familiar with the kernel internals, but the reason I
> didn't suggest this originally is it seems like it will be difficult
> to determine what the "correct" userspace registers are. For example,
> what happens if a performance counter is fixed to a given tid, the
> interrupt fires during a context switch from that task to another that
> is not being monitored, and the kernel is far enough along in the
> context switch that the current task struct has been switched out?
> Reporting the new task's registers seems as bad as reporting the
> kernel's registers. But maybe this is easier than I imagine for
> whatever reason.
>
> Something to think about.
>
> - Kyle

We've noticed that the regressing changeset made it into stable
branches that distros are shipping now[0], and we're starting to
receive bug reports from our users. We would really like to get this
patch or something similar into 4.12 and the next stable releases if
at all possible.

Thanks!

- Kyle

[0] Full list in https://mail.mozilla.org/pipermail/rr-dev/2017-June/000500.html

2017-07-04 09:03:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Wed, Jun 28, 2017 at 03:55:07PM -0700, Kyle Huey wrote:

> > Having thought about this some more, I think Vince does make a good
> > point that throwing away samples is liable to break stuff, e.g. that
> > which only relies on (non-sensitive) samples.
> >
> > It still seems wrong to make up data, though.

It is something we do in other places as well though. For example the
printk() %pK thing fakes NULL pointers when kptr_restrict is set.

Faking data gets a wee bit tricky in how much data we need to clear
through, its not only IP, pretty much everything we get from the
interrupt context, like the branch stack and registers is also suspect.

> > Maybe for exclude_kernel && !exclude_user events we can always generate
> > samples from the user regs, rather than the exception regs. That's going
> > to be closer to what the user wants, regardless. I'll take a look
> > tomorrow.
>
> I'm not very familiar with the kernel internals, but the reason I
> didn't suggest this originally is it seems like it will be difficult
> to determine what the "correct" userspace registers are. For example,
> what happens if a performance counter is fixed to a given tid, the
> interrupt fires during a context switch from that task to another that
> is not being monitored, and the kernel is far enough along in the
> context switch that the current task struct has been switched out?
> Reporting the new task's registers seems as bad as reporting the
> kernel's registers. But maybe this is easier than I imagine for
> whatever reason.

If the counter is fixed to a task then its scheduled along with the
task. We'll schedule out the event before doing the actual task switch
and switch in the new event after.

That said, with a per-cpu event the TID sample value is indeed subject
to skid like you describe.

2017-07-04 09:07:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Thu, Jun 29, 2017 at 10:12:33AM +0200, Ingo Molnar wrote:
>
> * Mark Rutland <[email protected]> wrote:
>
> > It still seems wrong to make up data, though.
>
> So what we have here is a hardware quirk: we asked for user-space samples, but
> didn't get them and we cannot expose the kernel-internal address.
>
> The question is, how do we handle the hardware quirk. Since we cannot fix the
> hardware on existing systems there's really just two choices:
>
> - Lose the sample (and signal it as a lost sample)
>
> - Keep the sample but change the sensitive kernel-internal address to something
> that is not sensitive: 0 or -1 works, but we could perhaps also return a
> well-known user-space address such as the vDSO syscall trampoline or such?
>
> there's no other option really.
>
> I'd lean towards Vince's take: losing samples is more surprising than getting the
> occasional sample with some sanitized data in it.
>
> If we make the artificial data still a meaningful user-space address, related to
> kernel entries, then it might even be a bonus, as users would learn to recognize
> it as: 'oh, skid artifact, I know about that'.

So while we could easily fake SAMPLE_IP to do as you suggest, other
entries might be much harder to fake. That said, I have no problems with
just 0 stuffing them.

The only real problem is determining how much to stuff I suppose.

2017-07-04 09:34:43

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Tue, Jul 04, 2017 at 11:03:13AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 28, 2017 at 03:55:07PM -0700, Kyle Huey wrote:
>
> > > Having thought about this some more, I think Vince does make a good
> > > point that throwing away samples is liable to break stuff, e.g. that
> > > which only relies on (non-sensitive) samples.
> > >
> > > It still seems wrong to make up data, though.
>
> It is something we do in other places as well though. For example the
> printk() %pK thing fakes NULL pointers when kptr_restrict is set.

It looks like I'm outnumbered on that, then. :)

I'd still argue it's worth somehow indicating which samples were thrown
away, so that (updated) userspace can choose to ignore them, but I guess
that can come later.

> Faking data gets a wee bit tricky in how much data we need to clear
> through, its not only IP, pretty much everything we get from the
> interrupt context, like the branch stack and registers is also suspect.

Indeed. I'll take a run through __perf_event_output() and callees, and
see what we need to drop.

> > > Maybe for exclude_kernel && !exclude_user events we can always generate
> > > samples from the user regs, rather than the exception regs. That's going
> > > to be closer to what the user wants, regardless. I'll take a look
> > > tomorrow.
> >
> > I'm not very familiar with the kernel internals, but the reason I
> > didn't suggest this originally is it seems like it will be difficult
> > to determine what the "correct" userspace registers are. For example,
> > what happens if a performance counter is fixed to a given tid, the
> > interrupt fires during a context switch from that task to another that
> > is not being monitored, and the kernel is far enough along in the
> > context switch that the current task struct has been switched out?
> > Reporting the new task's registers seems as bad as reporting the
> > kernel's registers. But maybe this is easier than I imagine for
> > whatever reason.
>
> If the counter is fixed to a task then its scheduled along with the
> task. We'll schedule out the event before doing the actual task switch
> and switch in the new event after.
>
> That said, with a per-cpu event the TID sample value is indeed subject
> to skid like you describe.

For per-cpu events, does that matter? Those don't have TID filters in
the first place, no?

Thanks,
Mark.

2017-07-04 09:46:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Tue, Jul 04, 2017 at 10:33:45AM +0100, Mark Rutland wrote:

> > That said, with a per-cpu event the TID sample value is indeed subject
> > to skid like you describe.
>
> For per-cpu events, does that matter? Those don't have TID filters in
> the first place, no?

eBPF can do all sorts I suppose.

But yes, at some point you just have to give up, there's only so much
one can do.

2017-07-04 10:04:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)


* Peter Zijlstra <[email protected]> wrote:

> On Thu, Jun 29, 2017 at 10:12:33AM +0200, Ingo Molnar wrote:
> >
> > * Mark Rutland <[email protected]> wrote:
> >
> > > It still seems wrong to make up data, though.
> >
> > So what we have here is a hardware quirk: we asked for user-space samples, but
> > didn't get them and we cannot expose the kernel-internal address.
> >
> > The question is, how do we handle the hardware quirk. Since we cannot fix the
> > hardware on existing systems there's really just two choices:
> >
> > - Lose the sample (and signal it as a lost sample)
> >
> > - Keep the sample but change the sensitive kernel-internal address to something
> > that is not sensitive: 0 or -1 works, but we could perhaps also return a
> > well-known user-space address such as the vDSO syscall trampoline or such?
> >
> > there's no other option really.
> >
> > I'd lean towards Vince's take: losing samples is more surprising than getting the
> > occasional sample with some sanitized data in it.
> >
> > If we make the artificial data still a meaningful user-space address, related to
> > kernel entries, then it might even be a bonus, as users would learn to recognize
> > it as: 'oh, skid artifact, I know about that'.
>
> So while we could easily fake SAMPLE_IP to do as you suggest, other
> entries might be much harder to fake. That said, I have no problems with
> just 0 stuffing them.
>
> The only real problem is determining how much to stuff I suppose.

I think the RIP is the most important one to fix up in an informative fashion
(instead of just zeroing it out), so that mainstream users of 'perf top' or
'perf report' have a chance to see that certain entries have this skid artifact.

The other registers should be zeroed out once we stop trusting a sample.

Thanks,

Ingo

2017-07-04 10:23:02

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Tue, Jul 04, 2017 at 10:33:45AM +0100, Mark Rutland wrote:
> On Tue, Jul 04, 2017 at 11:03:13AM +0200, Peter Zijlstra wrote:
> > Faking data gets a wee bit tricky in how much data we need to clear
> > through, its not only IP, pretty much everything we get from the
> > interrupt context, like the branch stack and registers is also suspect.
>
> Indeed. I'll take a run through __perf_event_output() and callees, and
> see what we need to drop.

Looking at perf_event_sample_format in uapi/linux/perf_event.h, there
are samples that are obviously sensitive, and should be dropped:

* PERF_SAMPLE_IP
* PERF_SAMPLE_CALLCHAIN
* PERF_SAMPLE_BRANCH_STACK
* PERF_SAMPLE_REGS_INTR

... samples that look benign:

* PERF_SAMPLE_TID
* PERF_SAMPLE_TIME
* PERF_SAMPLE_CPU
* PERF_SAMPLE_PERIOD
* PERF_SAMPLE_REGS_USER
* PERF_SAMPLE_STACK_USER
* PERF_SAMPLE_READ
* PERF_SAMPLE_ID
* PERF_SAMPLE_STREAM_ID
* PERF_SAMPLE_IDENTIFIER

.. and samples that I have no idea about:

* PERF_SAMPLE_ADDR
* PERF_SAMPLE_RAW
* PERF_SAMPLE_WEIGHT
* PERF_SAMPLE_DATA_SRC
* PERF_SAMPLE_TRANSACTION

Should any of those be moved into the "should be dropped" pile?

Thanks,
Mark.

2017-07-06 05:08:03

by Robert O'Callahan

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <[email protected]> wrote:
> Should any of those be moved into the "should be dropped" pile?

Why not be conservative and clear every sample you're not sure about?

We'd appreciate a fix sooner rather than later here, since rr is
currently broken on every stable Linux kernel and our attempts to
implement a workaround have failed.

(We have separate "interrupt" and "measure" counters, and I thought we
might work around this regression by programming the "interrupt"
counter to count kernel events as well as user events (interrupting
early is OK), but that caused our (completely separate) "measure"
counter to report off-by-one results (!), which seems to be a
different bug present on a range of older kernels.)

Thanks,
Rob
--
lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD
selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr
.a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn

2017-07-11 02:03:38

by Kyle Huey

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <[email protected]> wrote:
> On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <[email protected]> wrote:
>> Should any of those be moved into the "should be dropped" pile?
>
> Why not be conservative and clear every sample you're not sure about?
>
> We'd appreciate a fix sooner rather than later here, since rr is
> currently broken on every stable Linux kernel and our attempts to
> implement a workaround have failed.
>
> (We have separate "interrupt" and "measure" counters, and I thought we
> might work around this regression by programming the "interrupt"
> counter to count kernel events as well as user events (interrupting
> early is OK), but that caused our (completely separate) "measure"
> counter to report off-by-one results (!), which seems to be a
> different bug present on a range of older kernels.)

This seems to have stalled out here unfortunately.

Can we get a consensus (from ingo or peterz?) on Mark's question? Or,
alternatively, can we move the patch at the top of this thread forward
on the stable branches until we do reach an answer to that question?

We've abandoned hope of working around this problem in rr and are
currently broken for all of our users with an up-to-date kernel, so
the situation for us is rather dire at the moment I'm afraid.

Thanks,

- Kyle

2017-07-11 09:04:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)


* Kyle Huey <[email protected]> wrote:

> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <[email protected]> wrote:
> > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <[email protected]> wrote:
> >> Should any of those be moved into the "should be dropped" pile?
> >
> > Why not be conservative and clear every sample you're not sure about?
> >
> > We'd appreciate a fix sooner rather than later here, since rr is
> > currently broken on every stable Linux kernel and our attempts to
> > implement a workaround have failed.
> >
> > (We have separate "interrupt" and "measure" counters, and I thought we
> > might work around this regression by programming the "interrupt"
> > counter to count kernel events as well as user events (interrupting
> > early is OK), but that caused our (completely separate) "measure"
> > counter to report off-by-one results (!), which seems to be a
> > different bug present on a range of older kernels.)
>
> This seems to have stalled out here unfortunately.
>
> Can we get a consensus (from ingo or peterz?) on Mark's question? Or,
> alternatively, can we move the patch at the top of this thread forward
> on the stable branches until we do reach an answer to that question?
>
> We've abandoned hope of working around this problem in rr and are
> currently broken for all of our users with an up-to-date kernel, so
> the situation for us is rather dire at the moment I'm afraid.

Sorry about that - I've queued up a revert for the original commit and will send
the fix to Linus later today. I've added a -stable tag as well so it can be
forwarded to Greg the moment it hits upstream.

We should do the original fix as well, but in a version that does not skip the
sample but zeroes out the RIP and registers (or sets them all to -1LL) - and also
covers other possible places where skid-RIP is exposed, such as LBR.

Thanks,

Ingo

Subject: [tip:perf/urgent] Revert "perf/core: Drop kernel samples even though :u is specified"

Commit-ID: 6a8a75f3235724c5941a33e287b2f98966ad14c5
Gitweb: http://git.kernel.org/tip/6a8a75f3235724c5941a33e287b2f98966ad14c5
Author: Ingo Molnar <[email protected]>
AuthorDate: Tue, 11 Jul 2017 10:56:54 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 11 Jul 2017 10:56:54 +0200

Revert "perf/core: Drop kernel samples even though :u is specified"

This reverts commit cc1582c231ea041fbc68861dfaf957eaf902b829.

This commit introduced a regression that broke rr-project, which uses sampling
events to receive a signal on overflow (but does not care about the contents
of the sample). These signals are critical to the correct operation of rr.

There's been some back and forth about how to fix it - but to not keep
applications in limbo queue up a revert.

Reported-by: Kyle Huey <[email protected]>
Acked-by: Kyle Huey <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Jin Yao <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/20170628105600.GC5981@leverpostej
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 21 ---------------------
1 file changed, 21 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4d2c32f..9747e42 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7308,21 +7308,6 @@ int perf_event_account_interrupt(struct perf_event *event)
return __perf_event_account_interrupt(event, 1);
}

-static bool sample_is_allowed(struct perf_event *event, struct pt_regs *regs)
-{
- /*
- * Due to interrupt latency (AKA "skid"), we may enter the
- * kernel before taking an overflow, even if the PMU is only
- * counting user events.
- * To avoid leaking information to userspace, we must always
- * reject kernel samples when exclude_kernel is set.
- */
- if (event->attr.exclude_kernel && !user_mode(regs))
- return false;
-
- return true;
-}
-
/*
* Generic event overflow handling, sampling.
*/
@@ -7344,12 +7329,6 @@ static int __perf_event_overflow(struct perf_event *event,
ret = __perf_event_account_interrupt(event, throttle);

/*
- * For security, drop the skid kernel samples if necessary.
- */
- if (!sample_is_allowed(event, regs))
- return ret;
-
- /*
* XXX event_limit might not quite work as expected on inherited
* events
*/

2017-07-11 13:07:35

by Jin Yao

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)



On 7/11/2017 5:03 PM, Ingo Molnar wrote:
> * Kyle Huey <[email protected]> wrote:
>
>> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <[email protected]> wrote:
>>> On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <[email protected]> wrote:
>>>> Should any of those be moved into the "should be dropped" pile?
>>> Why not be conservative and clear every sample you're not sure about?
>>>
>>> We'd appreciate a fix sooner rather than later here, since rr is
>>> currently broken on every stable Linux kernel and our attempts to
>>> implement a workaround have failed.
>>>
>>> (We have separate "interrupt" and "measure" counters, and I thought we
>>> might work around this regression by programming the "interrupt"
>>> counter to count kernel events as well as user events (interrupting
>>> early is OK), but that caused our (completely separate) "measure"
>>> counter to report off-by-one results (!), which seems to be a
>>> different bug present on a range of older kernels.)
>> This seems to have stalled out here unfortunately.
>>
>> Can we get a consensus (from ingo or peterz?) on Mark's question? Or,
>> alternatively, can we move the patch at the top of this thread forward
>> on the stable branches until we do reach an answer to that question?
>>
>> We've abandoned hope of working around this problem in rr and are
>> currently broken for all of our users with an up-to-date kernel, so
>> the situation for us is rather dire at the moment I'm afraid.
> Sorry about that - I've queued up a revert for the original commit and will send
> the fix to Linus later today. I've added a -stable tag as well so it can be
> forwarded to Greg the moment it hits upstream.
>
> We should do the original fix as well, but in a version that does not skip the
> sample but zeroes out the RIP and registers (or sets them all to -1LL) - and also
> covers other possible places where skid-RIP is exposed, such as LBR.
>
> Thanks,
>
> Ingo

Could we provide 2 options in user space when enabling the event sampling?

One option is for the use case like rr debugger which only cares the PMI
interrupt but doesn't care the skid. The skid samples doesn't need to be
dropped.

The other option is for the use case which needs the accurate sample
data. For this option, the skid samples are dropped.

I would suggest to let the user space make the decision to choose which
option.

Thanks
Jin Yao


2017-07-11 14:27:20

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Tue, Jul 11, 2017 at 11:03:58AM +0200, Ingo Molnar wrote:
>
> * Kyle Huey <[email protected]> wrote:
>
> > On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <[email protected]> wrote:
> > > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <[email protected]> wrote:
> > >> Should any of those be moved into the "should be dropped" pile?
> > >
> > > Why not be conservative and clear every sample you're not sure about?
> > >
> > > We'd appreciate a fix sooner rather than later here, since rr is
> > > currently broken on every stable Linux kernel and our attempts to
> > > implement a workaround have failed.
> > >
> > > (We have separate "interrupt" and "measure" counters, and I thought we
> > > might work around this regression by programming the "interrupt"
> > > counter to count kernel events as well as user events (interrupting
> > > early is OK), but that caused our (completely separate) "measure"
> > > counter to report off-by-one results (!), which seems to be a
> > > different bug present on a range of older kernels.)
> >
> > This seems to have stalled out here unfortunately.
> >
> > Can we get a consensus (from ingo or peterz?) on Mark's question? Or,
> > alternatively, can we move the patch at the top of this thread forward
> > on the stable branches until we do reach an answer to that question?
> >
> > We've abandoned hope of working around this problem in rr and are
> > currently broken for all of our users with an up-to-date kernel, so
> > the situation for us is rather dire at the moment I'm afraid.
>
> Sorry about that - I've queued up a revert for the original commit and will send
> the fix to Linus later today. I've added a -stable tag as well so it can be
> forwarded to Greg the moment it hits upstream.

Thanks for handling this.

Mark.

2017-07-11 15:32:40

by Kyle Huey

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Tue, Jul 11, 2017 at 2:03 AM, Ingo Molnar <[email protected]> wrote:
>
> * Kyle Huey <[email protected]> wrote:
>
>> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <[email protected]> wrote:
>> > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <[email protected]> wrote:
>> >> Should any of those be moved into the "should be dropped" pile?
>> >
>> > Why not be conservative and clear every sample you're not sure about?
>> >
>> > We'd appreciate a fix sooner rather than later here, since rr is
>> > currently broken on every stable Linux kernel and our attempts to
>> > implement a workaround have failed.
>> >
>> > (We have separate "interrupt" and "measure" counters, and I thought we
>> > might work around this regression by programming the "interrupt"
>> > counter to count kernel events as well as user events (interrupting
>> > early is OK), but that caused our (completely separate) "measure"
>> > counter to report off-by-one results (!), which seems to be a
>> > different bug present on a range of older kernels.)
>>
>> This seems to have stalled out here unfortunately.
>>
>> Can we get a consensus (from ingo or peterz?) on Mark's question? Or,
>> alternatively, can we move the patch at the top of this thread forward
>> on the stable branches until we do reach an answer to that question?
>>
>> We've abandoned hope of working around this problem in rr and are
>> currently broken for all of our users with an up-to-date kernel, so
>> the situation for us is rather dire at the moment I'm afraid.
>
> Sorry about that - I've queued up a revert for the original commit and will send
> the fix to Linus later today. I've added a -stable tag as well so it can be
> forwarded to Greg the moment it hits upstream.

Great, thank you.

- Kyle

2017-07-12 07:57:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)


* Jin, Yao <[email protected]> wrote:

> Could we provide 2 options in user space when enabling the event sampling?
>
> One option is for the use case like rr debugger which only cares the PMI
> interrupt but doesn't care the skid. The skid samples doesn't need to be
> dropped.

Since it's an information leak, this is not something we want to expose as an
interface. It's also a very ugly interface: why not just *clear* the sample,
instead of dropping it?

The hardware messed up and gave us something we specifically did not permit
user-space to collect. We have to fix the bad effects to the best extent we can,
and not based on some knob.

Thanks,

Ingo

2017-07-18 00:07:30

by Kyle Huey

[permalink] [raw]
Subject: Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

On Tue, Jul 11, 2017 at 8:32 AM, Kyle Huey <[email protected]> wrote:
> On Tue, Jul 11, 2017 at 2:03 AM, Ingo Molnar <[email protected]> wrote:
>>
>> * Kyle Huey <[email protected]> wrote:
>>
>>> On Wed, Jul 5, 2017 at 10:07 PM, Robert O'Callahan <[email protected]> wrote:
>>> > On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland <[email protected]> wrote:
>>> >> Should any of those be moved into the "should be dropped" pile?
>>> >
>>> > Why not be conservative and clear every sample you're not sure about?
>>> >
>>> > We'd appreciate a fix sooner rather than later here, since rr is
>>> > currently broken on every stable Linux kernel and our attempts to
>>> > implement a workaround have failed.
>>> >
>>> > (We have separate "interrupt" and "measure" counters, and I thought we
>>> > might work around this regression by programming the "interrupt"
>>> > counter to count kernel events as well as user events (interrupting
>>> > early is OK), but that caused our (completely separate) "measure"
>>> > counter to report off-by-one results (!), which seems to be a
>>> > different bug present on a range of older kernels.)
>>>
>>> This seems to have stalled out here unfortunately.
>>>
>>> Can we get a consensus (from ingo or peterz?) on Mark's question? Or,
>>> alternatively, can we move the patch at the top of this thread forward
>>> on the stable branches until we do reach an answer to that question?
>>>
>>> We've abandoned hope of working around this problem in rr and are
>>> currently broken for all of our users with an up-to-date kernel, so
>>> the situation for us is rather dire at the moment I'm afraid.
>>
>> Sorry about that - I've queued up a revert for the original commit and will send
>> the fix to Linus later today. I've added a -stable tag as well so it can be
>> forwarded to Greg the moment it hits upstream.
>
> Great, thank you.
>
> - Kyle

Hi again,

I saw that the revert landed on perf/urgent but it doesn't look like
that got pulled by Linus in time for 4.13-rc1. Consider this a gentle
poke to please get this merged :)

- Kyle