LinuxLists.cc - Re: armv8pmu: Pending overflow interrupt is discarded when perf event is disabled

2023-11-20 22:32:32

Subject: Re: armv8pmu: Pending overflow interrupt is discarded when perf event is disabled

Adding [email protected] and [email protected],
sorry for the noise.

On 11/20/23, 12:36 PM, "Ashley, William" <[email protected] <mailto:[email protected]>> wrote:

An issue [1] was opened in the rr-debugger project reporting occasional missed
perf event overflow signals on arm64. I've been digging into this and think I
understand what's happening, but wanted to confirm my understanding.

The attached example application, derived from an rr-debugger test case, reports
when the value of a counter doesn't increase by the expected period +/- some
tolerance. When it is ping-ponged between cores (e.g. with taskset) at a high
frequency, it frequently reports increases of ~2x the expected. I've confirmed
this same behavior on kernels 5.4, 5.10, 6.1 and 6.5.

I found armv8pmu_disable_intens [2] that is called as part of event
de-scheduling and contains
/* Clear the overflow flag in case an interrupt is pending. */
write_pmovsclr(mask);
which results in any pending overflow interrupt being dropped. I added some
debug output here and indeed there is a correlation of this bit being high at
the point of the above code and the reproducer identifying a missed signal.

This behavior does not occur with pseudo-NMIs (irqchip.gicv3_pseudo_nmi=1)
enabled.

When an event is not being explicitly torn down (e.g. being closed), this seems
like an undesirable behavior. I haven't attempted to demo it yet, but I suspect
an application disabling an event temporarily could occasionally see the same
missed overflow signals. Is my understanding here correct? Does anyone have
thoughts on how this could be addressed without creating other issues?

[1] https://github.com/rr-debugger/rr/issues/3607 <https://github.com/rr-debugger/rr/issues/3607>
[2] https://github.com/torvalds/linux/blob/c42d9eeef8e5ba9292eda36fd8e3c11f35ee065c/drivers/perf/arm_pmuv3.c#L652C20-L652C43 <https://github.com/torvalds/linux/blob/c42d9eeef8e5ba9292eda36fd8e3c11f35ee065c/drivers/perf/arm_pmuv3.c#L652C20-L652C43>

2023-11-29 16:35:51

by Mark Rutland

[permalink] [raw]

Subject: Re: armv8pmu: Pending overflow interrupt is discarded when perf event is disabled

Hi William,

On Mon, Nov 20, 2023 at 10:32:10PM +0000, Ashley, William wrote:
> Adding [email protected] and [email protected],
> sorry for the noise.

Thanks for this!

For the benefit of others, the original mail (and attachment) can be found at:

https://lore.kernel.org/linux-perf-users/[email protected]/

For some reason, links and whitespace seem to have been mangled in the resend;
I'm not sure what happened there.

I've added Will Deacon, as Will and I co-maintain the ARM PMU drivers.

> On 11/20/23, 12:36 PM, "Ashley, William" <[email protected] <mailto:[email protected]>> wrote:
>
>
> An issue [1] was opened in the rr-debugger project reporting occasional missed
> perf event overflow signals on arm64. I've been digging into this and think I
> understand what's happening, but wanted to confirm my understanding.
>
> The attached example application, derived from an rr-debugger test case, reports
> when the value of a counter doesn't increase by the expected period +/- some
> tolerance. When it is ping-ponged between cores (e.g. with taskset) at a high
> frequency, it frequently reports increases of ~2x the expected. I've confirmed
> this same behavior on kernels 5.4, 5.10, 6.1 and 6.5.
>
>
> I found armv8pmu_disable_intens [2] that is called as part of event
> de-scheduling and contains
> /* Clear the overflow flag in case an interrupt is pending. */
> write_pmovsclr(mask);
> which results in any pending overflow interrupt being dropped. I added some
> debug output here and indeed there is a correlation of this bit being high at
> the point of the above code and the reproducer identifying a missed signal.

I think you're right that if we had an overflow asserted at this point, we'll
throw away the occurrence of the overflow (and so not call
perf_event_overflow() and generate a sample, etc).

It looks like we only lose the occurrence of the overflow; the actual counts
will get sampled correctly and when we next reprogram the event,
armpmu_event_set_period() should set up the next overflow period.

> This behavior does not occur with pseudo-NMIs (irqchip.gicv3_pseudo_nmi=1)
> enabled.

That's interesting, because it implies that the PMU overflow interrupt is being
recognised by the CPU while regular interrupts are disabled. There are some
narrow races where that could occur (e.g. taking a timer or scheduler IRQ
*just* as an overflow occurs), and some other cases I'd expect RR to avoid by
construction (e.g. if RR isn't using mode exclusion and also counts kernel
events). It's also worth noting that this means there are races even with
pseudo-NMI where overflows could be lost.

How often do you see the overflow being lost?

Does RR set any of the perf_event_attr::exclude_* bits? If not, does RR
intentionally count events that occur within the kernel?

> When an event is not being explicitly torn down (e.g. being closed), this seems
> like an undesirable behavior.

I agree it's undesirable, though my understanding is that for most other users
this isn't any worse than losing samples for other reasons (e.g. the perf ring
buffer being full), and for the general case of sample-based profiling, losing
a sample every so often isn't the end of the world.

> I haven't attempted to demo it yet, but I suspect
> an application disabling an event temporarily could occasionally see the same
> missed overflow signals. Is my understanding here correct?

That sounds right to me, though I haven't checked that end-to-end yet.

> Does anyone have thoughts on how this could be addressed without creating
> other issues?

We should be able to detect overflow from the counter value alone, so we might
be able to account for that when we actually read the event out, or when we
schedule it back in and reprogram the period.

I'm not sure if we can reasonably do that when scheduling the event out, since
if we're switching tasks, that'll queue up IRQ work which will be triggered in
the context of the next task.

We might be able to figure out that we have an overflow when we schedule the
event in under armpmu_start(), but I'll need to go digging to see if there are
any free-running counters as the comment above the call to
armpmu_event_set_period() says, or whether that's a historical artifact.

I suspect we might need to restructure the code somewhat to be able to catch
overflows more consistently. I'll see if I can come up with something, but we
might not be able to guarantee this in all cases.

Mark.

> [1] https://github.com/rr-debugger/rr/issues/3607 <https://github.com/rr-debugger/rr/issues/3607>
> [2] https://github.com/torvalds/linux/blob/c42d9eeef8e5ba9292eda36fd8e3c11f35ee065c/drivers/perf/arm_pmuv3.c#L652C20-L652C43 <https://github.com/torvalds/linux/blob/c42d9eeef8e5ba9292eda36fd8e3c11f35ee065c/drivers/perf/arm_pmuv3.c#L652C20-L652C43>

2023-11-30 11:49:20

by Mark Rutland

[permalink] [raw]

Subject: Re: armv8pmu: Pending overflow interrupt is discarded when perf event is disabled

On Wed, Nov 29, 2023 at 04:35:01PM +0000, Mark Rutland wrote:
> Hi William,
>
> On Mon, Nov 20, 2023 at 10:32:10PM +0000, Ashley, William wrote:
> > Adding [email protected] and [email protected],
> > sorry for the noise.
>
> Thanks for this!
>
> For the benefit of others, the original mail (and attachment) can be found at:
>
> https://lore.kernel.org/linux-perf-users/[email protected]/
>
> For some reason, links and whitespace seem to have been mangled in the resend;
> I'm not sure what happened there.
>
> I've added Will Deacon, as Will and I co-maintain the ARM PMU drivers.
>
> > On 11/20/23, 12:36 PM, "Ashley, William" <[email protected] <mailto:[email protected]>> wrote:
> >
> >
> > An issue [1] was opened in the rr-debugger project reporting occasional missed
> > perf event overflow signals on arm64. I've been digging into this and think I
> > understand what's happening, but wanted to confirm my understanding.
> >
> > The attached example application, derived from an rr-debugger test case, reports
> > when the value of a counter doesn't increase by the expected period +/- some
> > tolerance. When it is ping-ponged between cores (e.g. with taskset) at a high
> > frequency, it frequently reports increases of ~2x the expected. I've confirmed
> > this same behavior on kernels 5.4, 5.10, 6.1 and 6.5.
> >
> >
> > I found armv8pmu_disable_intens [2] that is called as part of event
> > de-scheduling and contains
> > /* Clear the overflow flag in case an interrupt is pending. */
> > write_pmovsclr(mask);
> > which results in any pending overflow interrupt being dropped. I added some
> > debug output here and indeed there is a correlation of this bit being high at
> > the point of the above code and the reproducer identifying a missed signal.
>
> I think you're right that if we had an overflow asserted at this point, we'll
> throw away the occurrence of the overflow (and so not call
> perf_event_overflow() and generate a sample, etc).
>
> It looks like we only lose the occurrence of the overflow; the actual counts
> will get sampled correctly and when we next reprogram the event,
> armpmu_event_set_period() should set up the next overflow period.
>
> > This behavior does not occur with pseudo-NMIs (irqchip.gicv3_pseudo_nmi=1)
> > enabled.
>
> That's interesting, because it implies that the PMU overflow interrupt is being
> recognised by the CPU while regular interrupts are disabled. There are some
> narrow races where that could occur (e.g. taking a timer or scheduler IRQ
> *just* as an overflow occurs), and some other cases I'd expect RR to avoid by
> construction (e.g. if RR isn't using mode exclusion and also counts kernel
> events). It's also worth noting that this means there are races even with
> pseudo-NMI where overflows could be lost.
>
> How often do you see the overflow being lost?
>
> Does RR set any of the perf_event_attr::exclude_* bits? If not, does RR
> intentionally count events that occur within the kernel?

Looking at the test, I see it sets perf_event_attr::exclude_kernel to 1, but
doesn't set perf_event_attr::exclude_host or perf_event_attr::exclude_hv. I
think the poorly-defined exclude_* bits are part of the problem here.

Using your test as-is on my ThunderX2, I can reproduce the period being longer
than expected by concurrently running the following in a shell:

while true; do
for C in $(seq 0 63); do
taskset -c -p $C ${TEST_PID_HERE};
done;
done > /dev/null

... resulting in:

| [mark@gravadlaks:~]% ./counter-overflow
| Pid 20060 running with period 10000 tolerance 1000
| Signal #1: last: 0, cur: 10292, max diff: 0
| Signal #415330: delta of 19999 is outside 10000 +/- 1000
| Signal #415330: last: 4153290187, cur: 4153310186, max diff: 10292
| Signal #489879: delta of 19998 is outside 10000 +/- 1000
| Signal #511842: delta of 20058 is outside 10000 +/- 1000
| Signal #511842: last: 5118430130, cur: 5118450188, max diff: 19999

However, if I modify the test to also set perf_event_attr::exclude_host=1, I do
not see any lost overflows after many minutes. On VHE hosts (like the
ThunderX2), the host kernel gets counted when perf_event_attr::exclude_host=0,
even if perf_event_attr::exclude_kernel=1 (which I agree is surprising), so I
think what's happening is the counters are counting in the host kernel, which
isn't what RR actually wants regardless.

From looking at:

https://github.com/rr-debugger/rr/blob/master/src/PerfCounters.cc

... it looks like RR only sets perf_event_attr::exclude_{kernel,guest}, and due
to this I think it's still counting the host kernel (and potentially the
hypervisor, depending on that hypervisor's policy for what it filters from the
guest).

There's still a potential race on the kernel side, but I suspect that if RR
were to set the non-user exclude_* bits it could significantly reduce the
chance of a lost overflow signal on existing kernels, at least on bare-metal.

This is an annoying ABI; the interaction between all the bits is ill-defined,
especially host/hv/kernel (since for VHE those are the same thing). Ideally
this would have had the opposite polarity and you'd have only needed to set an
include_user bit and be done with it.

I'll continue to look at what we can do kernel-side, but I reckon it's worth
having RR try the other exclude bits regardless, if that's possible? It would
be interesting to know whether that helps you under a hypervisor.

Thanks,
Mark.

> > When an event is not being explicitly torn down (e.g. being closed), this seems
> > like an undesirable behavior.
>
> I agree it's undesirable, though my understanding is that for most other users
> this isn't any worse than losing samples for other reasons (e.g. the perf ring
> buffer being full), and for the general case of sample-based profiling, losing
> a sample every so often isn't the end of the world.
>
> > I haven't attempted to demo it yet, but I suspect
> > an application disabling an event temporarily could occasionally see the same
> > missed overflow signals. Is my understanding here correct?
>
> That sounds right to me, though I haven't checked that end-to-end yet.
>
> > Does anyone have thoughts on how this could be addressed without creating
> > other issues?
>
> We should be able to detect overflow from the counter value alone, so we might
> be able to account for that when we actually read the event out, or when we
> schedule it back in and reprogram the period.
>
> I'm not sure if we can reasonably do that when scheduling the event out, since
> if we're switching tasks, that'll queue up IRQ work which will be triggered in
> the context of the next task.
>
> We might be able to figure out that we have an overflow when we schedule the
> event in under armpmu_start(), but I'll need to go digging to see if there are
> any free-running counters as the comment above the call to
> armpmu_event_set_period() says, or whether that's a historical artifact.
>
> I suspect we might need to restructure the code somewhat to be able to catch
> overflows more consistently. I'll see if I can come up with something, but we
> might not be able to guarantee this in all cases.
>
> Mark.
>
> > [1] https://github.com/rr-debugger/rr/issues/3607 <https://github.com/rr-debugger/rr/issues/3607>
> > [2] https://github.com/torvalds/linux/blob/c42d9eeef8e5ba9292eda36fd8e3c11f35ee065c/drivers/perf/arm_pmuv3.c#L652C20-L652C43 <https://github.com/torvalds/linux/blob/c42d9eeef8e5ba9292eda36fd8e3c11f35ee065c/drivers/perf/arm_pmuv3.c#L652C20-L652C43>

2023-11-30 12:23:36

by Mark Rutland

[permalink] [raw]

Subject: Re: armv8pmu: Pending overflow interrupt is discarded when perf event is disabled

On Thu, Nov 30, 2023 at 11:48:54AM +0000, Mark Rutland wrote:
> On Wed, Nov 29, 2023 at 04:35:01PM +0000, Mark Rutland wrote:
> > Does RR set any of the perf_event_attr::exclude_* bits? If not, does RR
> > intentionally count events that occur within the kernel?
>
> Looking at the test, I see it sets perf_event_attr::exclude_kernel to 1, but
> doesn't set perf_event_attr::exclude_host or perf_event_attr::exclude_hv. I
> think the poorly-defined exclude_* bits are part of the problem here.
>
> Using your test as-is on my ThunderX2, I can reproduce the period being longer
> than expected by concurrently running the following in a shell:
>
> while true; do
> for C in $(seq 0 63); do
> taskset -c -p $C ${TEST_PID_HERE};
> done;
> done > /dev/null
>
> ... resulting in:
>
> | [mark@gravadlaks:~]% ./counter-overflow
> | Pid 20060 running with period 10000 tolerance 1000
> | Signal #1: last: 0, cur: 10292, max diff: 0
> | Signal #415330: delta of 19999 is outside 10000 +/- 1000
> | Signal #415330: last: 4153290187, cur: 4153310186, max diff: 10292
> | Signal #489879: delta of 19998 is outside 10000 +/- 1000
> | Signal #511842: delta of 20058 is outside 10000 +/- 1000
> | Signal #511842: last: 5118430130, cur: 5118450188, max diff: 19999
>
> However, if I modify the test to also set perf_event_attr::exclude_host=1, I do
> not see any lost overflows after many minutes. On VHE hosts (like the
> ThunderX2), the host kernel gets counted when perf_event_attr::exclude_host=0,
> even if perf_event_attr::exclude_kernel=1 (which I agree is surprising), so I
> think what's happening is the counters are counting in the host kernel, which
> isn't what RR actually wants regardless.

> I'll continue to look at what we can do kernel-side, but I reckon it's worth
> having RR try the other exclude bits regardless, if that's possible? It would
> be interesting to know whether that helps you under a hypervisor.

Sorry, the above is wrong, and I do not recommend RR goes and changes its
exclude_* settings.

I had misread the logic in armv8pmu_set_event_filter(), but looking again
that's saner than I thought it was, and what was actually happening in my
testing is that exclude_host also filtered host EL0 (userspace), and so the
test received *no* overflow signals.

I'll get back to looking at how we can better capture the overflow when
removing an event.

Mark.

2023-12-05 18:16:02

by Mark Rutland

[permalink] [raw]

Subject: Re: armv8pmu: Pending overflow interrupt is discarded when perf event is disabled

On Thu, Nov 30, 2023 at 12:23:13PM +0000, Mark Rutland wrote:
> On Thu, Nov 30, 2023 at 11:48:54AM +0000, Mark Rutland wrote:
> > On Wed, Nov 29, 2023 at 04:35:01PM +0000, Mark Rutland wrote:
> > > Does RR set any of the perf_event_attr::exclude_* bits? If not, does RR
> > > intentionally count events that occur within the kernel?
> >
> > Looking at the test, I see it sets perf_event_attr::exclude_kernel to 1, but
> > doesn't set perf_event_attr::exclude_host or perf_event_attr::exclude_hv. I
> > think the poorly-defined exclude_* bits are part of the problem here.
> >
> > Using your test as-is on my ThunderX2, I can reproduce the period being longer
> > than expected by concurrently running the following in a shell:
> >
> > while true; do
> > for C in $(seq 0 63); do
> > taskset -c -p $C ${TEST_PID_HERE};
> > done;
> > done > /dev/null
> >
> > ... resulting in:
> >
> > | [mark@gravadlaks:~]% ./counter-overflow
> > | Pid 20060 running with period 10000 tolerance 1000
> > | Signal #1: last: 0, cur: 10292, max diff: 0
> > | Signal #415330: delta of 19999 is outside 10000 +/- 1000
> > | Signal #415330: last: 4153290187, cur: 4153310186, max diff: 10292
> > | Signal #489879: delta of 19998 is outside 10000 +/- 1000
> > | Signal #511842: delta of 20058 is outside 10000 +/- 1000
> > | Signal #511842: last: 5118430130, cur: 5118450188, max diff: 19999
> >
> > However, if I modify the test to also set perf_event_attr::exclude_host=1, I do
> > not see any lost overflows after many minutes. On VHE hosts (like the
> > ThunderX2), the host kernel gets counted when perf_event_attr::exclude_host=0,
> > even if perf_event_attr::exclude_kernel=1 (which I agree is surprising), so I
> > think what's happening is the counters are counting in the host kernel, which
> > isn't what RR actually wants regardless.
>
> > I'll continue to look at what we can do kernel-side, but I reckon it's worth
> > having RR try the other exclude bits regardless, if that's possible? It would
> > be interesting to know whether that helps you under a hypervisor.
>
> Sorry, the above is wrong, and I do not recommend RR goes and changes its
> exclude_* settings.
>
> I had misread the logic in armv8pmu_set_event_filter(), but looking again
> that's saner than I thought it was, and what was actually happening in my
> testing is that exclude_host also filtered host EL0 (userspace), and so the
> test received *no* overflow signals.
>
> I'll get back to looking at how we can better capture the overflow when
> removing an event.

I've spent a couple of days on this, and I have a prototype at:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=perf/overflow

The gist of that is:

* It's not viable to sample at start/stop time since we don't have all the
usual sample state (e.g. pt_regs), which means that we have to save/restore
the overflow status and rely on the regular interrupt handler to actually
sample after this is restored.

* Saving/restoring the overflow status requires reworking the way we manage the
sample period, separating the sample period from the period we program into
HW. This part looks fairly sane, but requires some rework of all the arm_pmu
drivers.

* We might need a couple of small changes to the perf core code (e.g. for the
IOCTLs that mess with the period), and this change is liable to affect other
architectures, so we'd need ot go audit that.

I reckon it's possible (with some testing and review) to get something like
this series merged, but I suspect it's too big for a stable backport.

Mark.

2023-12-05 18:34:05

by Ashley, William

[permalink] [raw]

Subject: Re: armv8pmu: Pending overflow interrupt is discarded when perf event is disabled

Mark,
Thanks for your time on this. Is there (any more of) a chance of getting a false second overflow signal with these changes? Especially for example when the counter is counting EL1, or when pseudo-nmi is enabled?

On 12/5/23, 12:15 PM, "Mark Rutland" <[email protected] <mailto:[email protected]>> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Thu, Nov 30, 2023 at 12:23:13PM +0000, Mark Rutland wrote:
> On Thu, Nov 30, 2023 at 11:48:54AM +0000, Mark Rutland wrote:
> > On Wed, Nov 29, 2023 at 04:35:01PM +0000, Mark Rutland wrote:
> > > Does RR set any of the perf_event_attr::exclude_* bits? If not, does RR
> > > intentionally count events that occur within the kernel?
> >
> > Looking at the test, I see it sets perf_event_attr::exclude_kernel to 1, but
> > doesn't set perf_event_attr::exclude_host or perf_event_attr::exclude_hv. I
> > think the poorly-defined exclude_* bits are part of the problem here.
> >
> > Using your test as-is on my ThunderX2, I can reproduce the period being longer
> > than expected by concurrently running the following in a shell:
> >
> > while true; do
> > for C in $(seq 0 63); do
> > taskset -c -p $C ${TEST_PID_HERE};
> > done;
> > done > /dev/null
> >
> > ... resulting in:
> >
> > | [mark@gravadlaks:~]% ./counter-overflow
> > | Pid 20060 running with period 10000 tolerance 1000
> > | Signal #1: last: 0, cur: 10292, max diff: 0
> > | Signal #415330: delta of 19999 is outside 10000 +/- 1000
> > | Signal #415330: last: 4153290187, cur: 4153310186, max diff: 10292
> > | Signal #489879: delta of 19998 is outside 10000 +/- 1000
> > | Signal #511842: delta of 20058 is outside 10000 +/- 1000
> > | Signal #511842: last: 5118430130, cur: 5118450188, max diff: 19999
> >
> > However, if I modify the test to also set perf_event_attr::exclude_host=1, I do
> > not see any lost overflows after many minutes. On VHE hosts (like the
> > ThunderX2), the host kernel gets counted when perf_event_attr::exclude_host=0,
> > even if perf_event_attr::exclude_kernel=1 (which I agree is surprising), so I
> > think what's happening is the counters are counting in the host kernel, which
> > isn't what RR actually wants regardless.
>
> > I'll continue to look at what we can do kernel-side, but I reckon it's worth
> > having RR try the other exclude bits regardless, if that's possible? It would
> > be interesting to know whether that helps you under a hypervisor.
>
> Sorry, the above is wrong, and I do not recommend RR goes and changes its
> exclude_* settings.
>
> I had misread the logic in armv8pmu_set_event_filter(), but looking again
> that's saner than I thought it was, and what was actually happening in my
> testing is that exclude_host also filtered host EL0 (userspace), and so the
> test received *no* overflow signals.
>
> I'll get back to looking at how we can better capture the overflow when
> removing an event.

I've spent a couple of days on this, and I have a prototype at:

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=perf/overflow <https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=perf/overflow>

The gist of that is:

* It's not viable to sample at start/stop time since we don't have all the
usual sample state (e.g. pt_regs), which means that we have to save/restore
the overflow status and rely on the regular interrupt handler to actually
sample after this is restored.

* Saving/restoring the overflow status requires reworking the way we manage the
sample period, separating the sample period from the period we program into
HW. This part looks fairly sane, but requires some rework of all the arm_pmu
drivers.

* We might need a couple of small changes to the perf core code (e.g. for the
IOCTLs that mess with the period), and this change is liable to affect other
architectures, so we'd need ot go audit that.

I reckon it's possible (with some testing and review) to get something like
this series merged, but I suspect it's too big for a stable backport.

Mark.