From: Kan Liang <[email protected]>
perf stat -M metrics relies on weak groups to reject unschedulable
groups and run them as non-groups.
This uses the group validation code in the kernel. Unfortunately
that code doesn't take pinned events, such as the NMI watchdog, into
account. So some groups can pass validation, but then later still
never schedule.
For example,
$echo 1 > /proc/sys/kernel/nmi_watchdog
$perf stat -M Page_Walks_Utilization
Performance counter stats for 'system wide':
<not counted> itlb_misses.walk_pending
(0.00%)
<not counted> dtlb_load_misses.walk_pending
(0.00%)
<not counted> dtlb_store_misses.walk_pending
(0.00%)
<not counted> ept.walk_pending
(0.00%)
<not counted> cycles
(0.00%)
1.176613558 seconds time elapsed
Current pinned events are always scheduled first. So the new group must
can be scheduled together with current pinned events. Otherwise, it will
never get a chance to be scheduled later.
The trick is to pretend the current pinned events as part of the new
group, and insert them into the fake_cpuc.
The simulation result will tell if they can be scheduled successfully.
The fake_cpuc never touch event state. The current pinned events will
not be impacted.
Disabling interrupts to prevent the events in current CPU's cpuc going
away and getting freed.
It won't catch all possible cases that cannot be scheduled, such as
events pinned differently on different CPUs, or complicated constraints.
The validation is based on current environment. It doesn't help on the
case, which first create a group and then a pinned event, either.
But for the most common case, the NMI watchdog interacting with the
current perf metrics, it is strong enough.
After applying the patch,
$echo 1 > /proc/sys/kernel/nmi_watchdog
$ perf stat -M Page_Walks_Utilization
Performance counter stats for 'system wide':
2,491,910 itlb_misses.walk_pending # 0.0
Page_Walks_Utilization (79.94%)
13,630,942 dtlb_load_misses.walk_pending
(80.02%)
207,255 dtlb_store_misses.walk_pending
(80.04%)
0 ept.walk_pending
(80.04%)
236,204,924 cycles
(79.97%)
0.901785713 seconds time elapsed
Reported-by: Stephane Eranian <[email protected]>
Suggested-by: Andi Kleen <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
Changes since V3:
- Update comments (preemption is disabled as well)
The V2 still only check current CPU's cpuc. Because I think we cannot
prevent the cpuc in other CPU without a lock. Adding a lock will
introduce extra overhead in some critical path, e.g. context switch.
The patch is good enough for the common case. We may leave the other
complicated cases as they are.
Changes since V1:
- Disabling interrupts to prevent the events in current CPU's cpuc
going away and getting freed.
- Update comments and description
arch/x86/events/core.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 43d0918..c4d9e14 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2032,9 +2032,12 @@ static int validate_event(struct perf_event *event)
*/
static int validate_group(struct perf_event *event)
{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_event *leader = event->group_leader;
struct cpu_hw_events *fake_cpuc;
- int ret = -EINVAL, n;
+ struct perf_event *pinned_event;
+ int ret = -EINVAL, n, i;
+ unsigned long flags;
fake_cpuc = allocate_fake_cpuc();
if (IS_ERR(fake_cpuc))
@@ -2054,9 +2057,38 @@ static int validate_group(struct perf_event *event)
if (n < 0)
goto out;
+ /*
+ * Disable interrupts and preemption to prevent the events in this
+ * CPU's cpuc going away and getting freed.
+ */
+ local_irq_save(flags);
+
+ /*
+ * The new group must can be scheduled together with current pinned
+ * events. Otherwise, it will never get a chance to be scheduled later.
+ *
+ * It won't catch all possible cases that cannot schedule, such as
+ * events pinned on CPU1, but the validation for a new CPU1 event
+ * running on other CPU. However, it's good enough to handle common
+ * cases like the global NMI watchdog.
+ */
+ for (i = 0; i < cpuc->n_events; i++) {
+ pinned_event = cpuc->event_list[i];
+ if (WARN_ON_ONCE(!pinned_event))
+ continue;
+ if (!pinned_event->attr.pinned)
+ continue;
+ fake_cpuc->n_events = n;
+ n = collect_events(fake_cpuc, pinned_event, false);
+ if (n < 0)
+ goto irq;
+ }
+
fake_cpuc->n_events = 0;
ret = x86_pmu.schedule_events(fake_cpuc, n, NULL);
+irq:
+ local_irq_restore(flags);
out:
free_fake_cpuc(fake_cpuc);
return ret;
--
2.7.4
On Thu, Jan 16, 2020 at 11:00:25AM -0800, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> perf stat -M metrics relies on weak groups to reject unschedulable
> groups and run them as non-groups.
> This uses the group validation code in the kernel. Unfortunately
> that code doesn't take pinned events, such as the NMI watchdog, into
> account. So some groups can pass validation, but then later still
> never schedule.
>
> For example,
>
> $echo 1 > /proc/sys/kernel/nmi_watchdog
> $perf stat -M Page_Walks_Utilization
>
> Performance counter stats for 'system wide':
>
> <not counted> itlb_misses.walk_pending
> (0.00%)
> <not counted> dtlb_load_misses.walk_pending
> (0.00%)
> <not counted> dtlb_store_misses.walk_pending
> (0.00%)
> <not counted> ept.walk_pending
> (0.00%)
> <not counted> cycles
> (0.00%)
>
> 1.176613558 seconds time elapsed
More unreadable mess.
On Thu, Jan 16, 2020 at 11:00:25AM -0800, [email protected] wrote:
> @@ -2054,9 +2057,38 @@ static int validate_group(struct perf_event *event)
> if (n < 0)
> goto out;
>
> + /*
> + * Disable interrupts and preemption to prevent the events in this
> + * CPU's cpuc going away and getting freed.
> + */
> + local_irq_save(flags);
> +
> + /*
> + * The new group must can be scheduled together with current pinned
> + * events. Otherwise, it will never get a chance to be scheduled later.
> + *
> + * It won't catch all possible cases that cannot schedule, such as
> + * events pinned on CPU1, but the validation for a new CPU1 event
> + * running on other CPU. However, it's good enough to handle common
> + * cases like the global NMI watchdog.
> + */
> + for (i = 0; i < cpuc->n_events; i++) {
> + pinned_event = cpuc->event_list[i];
> + if (WARN_ON_ONCE(!pinned_event))
> + continue;
> + if (!pinned_event->attr.pinned)
> + continue;
> + fake_cpuc->n_events = n;
> + n = collect_events(fake_cpuc, pinned_event, false);
> + if (n < 0)
> + goto irq;
> + }
> +
So I still completely hate this, because it makes the counter scheduling
more eratic.
It changes a situation where we only have false-positives (we allow
scheduling a group that might not ever get to run) into a situation
where we can have both false-positives and false-negatives.
Imagine the pinned event is for a currently running task; and that task
only runs sporadically. Then you can sometimes not create the group, but
mostly it'll work.
Yes, this is all very annoying, but I really don't see how this makes
anything any better.
> So I still completely hate this, because it makes the counter scheduling
> more eratic.
>
> It changes a situation where we only have false-positives (we allow
> scheduling a group that might not ever get to run) into a situation
> where we can have both false-positives and false-negatives.
>
> Imagine the pinned event is for a currently running task; and that task
> only runs sporadically. Then you can sometimes not create the group, but
> mostly it'll work.
Right now we have real situations which always fail because of this.
>
> Yes, this is all very annoying, but I really don't see how this makes
> anything any better.
The problem this is trying to solve is that some -M metrics fail
systematically with the NMI watchdog on. Metrics use weak groups
to avoid needing to have the full knowledge how events
can be scheduled in the user tools. So they ely on weak groups working.
Some of the JSON metrics have groups which always validate, but never
schedule with the NMI watchdog on.
If you have a better proposal to solve this problem please share
it.
I suppose we could use export at least the number of available counters
in sysfs and then split the groups in the user tools (and assume
that's good enough and full counter constraints are not needed)
I have some older patches to export the number at least. But fixing the
group validation seems better.
-Andi