2018-04-16 22:06:51

by Stephane Eranian

[permalink] [raw]
Subject: [RFC] perf/core: what is exclude_idle supposed to do

Hi,

I am trying to understand what the exclude_idle event attribute is supposed
to accomplish.
As per the definition in the header file:

exclude_idle : 1, /* don't count when idle */

Naively, I thought it would simply stop the event when running in the
context of the idle task (swapper, pid 0) on any CPU. That would seem to
match the succinct description.

However, running a simple:

$ perf record -a -e cycles:I sleep 5
perf_event_attr:
sample_type IP|TID|TIME|CPU|PERIOD
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID
exclude_idle 1

on an idle machine, showed me that this is not what is actually happening:
$ perf script
swapper 0 [000] 1132634.287442: 1 cycles:I:
ffffffff8100b1fb __intel_pmu_enable_all.isra.17 ([kernel.kallsyms])
swapper 0 [001] 1132634.287498: 1 cycles:I:
ffffffff8100b1fb __intel_pmu_enable_all.isra.17 ([kernel.kallsyms])


After looking at the code, it all made sense, it seems to current
implementation is only relevant for sw events. I don't understand why.

I am left wondering what is the goal of exclude_idle?


2018-04-17 06:21:41

by Jiri Olsa

[permalink] [raw]
Subject: Re: [RFC] perf/core: what is exclude_idle supposed to do

On Mon, Apr 16, 2018 at 10:04:53PM +0000, Stephane Eranian wrote:
> Hi,
>
> I am trying to understand what the exclude_idle event attribute is supposed
> to accomplish.
> As per the definition in the header file:
>
> exclude_idle : 1, /* don't count when idle */
>
> Naively, I thought it would simply stop the event when running in the
> context of the idle task (swapper, pid 0) on any CPU. That would seem to
> match the succinct description.
>
> However, running a simple:
>
> $ perf record -a -e cycles:I sleep 5
> perf_event_attr:
> sample_type IP|TID|TIME|CPU|PERIOD
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID
> exclude_idle 1
>
> on an idle machine, showed me that this is not what is actually happening:
> $ perf script
> swapper 0 [000] 1132634.287442: 1 cycles:I:
> ffffffff8100b1fb __intel_pmu_enable_all.isra.17 ([kernel.kallsyms])
> swapper 0 [001] 1132634.287498: 1 cycles:I:
> ffffffff8100b1fb __intel_pmu_enable_all.isra.17 ([kernel.kallsyms])
>
>
> After looking at the code, it all made sense, it seems to current
> implementation is only relevant for sw events. I don't understand why.

AFAICS it's not implemented

Peter suggested some time ago change for cpu-clock (attached)
I still have it in my queue, because it gives funny numbers
sometimes.. and I haven't figured it out so far

the idea so far was to use cpu-clock,cpu-clock:I events to
count and display idle % in perf top/record and stat metrics

jirka


---
From 7f20b047cf56f8086d79007175592633798656ce Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <[email protected]>
Date: Thu, 23 Nov 2017 16:15:36 +0100
Subject: [PATCH] perf: Add support to monitor idle time on cpu-clock

Adding support to use perf_event_attr::exclude_idle
(the :I modifierr in perf tools)to monitor idle time
on cpu-clock event.

Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Jiri Olsa <[email protected]>
---
kernel/events/core.c | 22 ++++++++++++++++++----
kernel/sched/idle_task.c | 17 +++++++++++++++++
2 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2989e3b0fe8b..62c0dc75ed11 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9391,19 +9391,33 @@ static void perf_swevent_init_hrtimer(struct perf_event *event)
* Software event: cpu wall time clock
*/

+static u64 cpu_clock_count(struct perf_event *event)
+{
+ u64 now = local_clock();
+
+ if (event->attr.exclude_idle)
+ now -= idle_task(event->oncpu)->se.sum_exec_runtime;
+
+ return now;
+}
+
static void cpu_clock_event_update(struct perf_event *event)
{
- s64 prev;
+ s64 prev, delta;
u64 now;

- now = local_clock();
+ now = cpu_clock_count(event);
prev = local64_xchg(&event->hw.prev_count, now);
- local64_add(now - prev, &event->count);
+ delta = now - prev;
+ if (delta > 0)
+ local64_add(delta, &event->count);
}

static void cpu_clock_event_start(struct perf_event *event, int flags)
{
- local64_set(&event->hw.prev_count, local_clock());
+ u64 now = cpu_clock_count(event);
+
+ local64_set(&event->hw.prev_count, now);
perf_swevent_start_hrtimer(event);
}

diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d518664cce4f..e360ce5dd9a2 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -27,9 +27,14 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
static struct task_struct *
pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
+ struct task_struct *idle = rq->idle;
+
put_prev_task(rq, prev);
update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
+
+ idle->se.exec_start = rq_clock_task(rq);
+
return rq->idle;
}

@@ -48,6 +53,15 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags)

static void put_prev_task_idle(struct rq *rq, struct task_struct *prev)
{
+ struct task_struct *idle = rq->idle;
+ u64 delta, now;
+
+ now = rq_clock_task(rq);
+ delta = now - idle->se.exec_start;
+ if (likely((s64)delta > 0))
+ idle->se.sum_exec_runtime += delta;
+ idle->se.exec_start = now;
+
rq_last_tick_reset(rq);
}

@@ -57,6 +71,9 @@ static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued)

static void set_curr_task_idle(struct rq *rq)
{
+ struct task_struct *idle = rq->idle;
+
+ idle->se.exec_start = rq_clock_task(rq);
}

static void switched_to_idle(struct rq *rq, struct task_struct *p)
--
2.13.6


2018-04-17 13:43:02

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [RFC] perf/core: what is exclude_idle supposed to do

Em Mon, Apr 16, 2018 at 10:04:53PM +0000, Stephane Eranian escreveu:
> Hi,
>
> I am trying to understand what the exclude_idle event attribute is supposed
> to accomplish.
> As per the definition in the header file:
>
> exclude_idle : 1, /* don't count when idle */
>
> Naively, I thought it would simply stop the event when running in the
> context of the idle task (swapper, pid 0) on any CPU. That would seem to
> match the succinct description.
>
> However, running a simple:
>
> $ perf record -a -e cycles:I sleep 5
> perf_event_attr:
> sample_type IP|TID|TIME|CPU|PERIOD
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID
> exclude_idle 1
>
> on an idle machine, showed me that this is not what is actually happening:
> $ perf script
> swapper 0 [000] 1132634.287442: 1 cycles:I:
> ffffffff8100b1fb __intel_pmu_enable_all.isra.17 ([kernel.kallsyms])
> swapper 0 [001] 1132634.287498: 1 cycles:I:
> ffffffff8100b1fb __intel_pmu_enable_all.isra.17 ([kernel.kallsyms])
>
>
> After looking at the code, it all made sense, it seems to current
> implementation is only relevant for sw events. I don't understand why.
>
> I am left wondering what is the goal of exclude_idle?

To do something it is not doing, i.e. to do what you expected it to do.

There were messages exchanged recently where PeterZ said that this is
just that needs fixing.

- Arnaldo

2018-04-18 15:12:03

by Vince Weaver

[permalink] [raw]
Subject: Re: [RFC] perf/core: what is exclude_idle supposed to do

On Tue, 17 Apr 2018, Jiri Olsa wrote:

> On Mon, Apr 16, 2018 at 10:04:53PM +0000, Stephane Eranian wrote:
> > Hi,
> >
> > I am trying to understand what the exclude_idle event attribute is supposed
> > to accomplish.
> > As per the definition in the header file:
> >
> > exclude_idle : 1, /* don't count when idle */
>
> AFAICS it's not implemented

so just to be completely clear hear, we're saying that the "exclude_idle"
modifier has never done anything useful and still doesn't?

If so I should update the perf_event_open manpage to spell this out.

Vince

2018-04-20 08:36:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf/core: what is exclude_idle supposed to do

On Mon, Apr 16, 2018 at 10:04:53PM +0000, Stephane Eranian wrote:
> Hi,
>
> I am trying to understand what the exclude_idle event attribute is supposed
> to accomplish.
> As per the definition in the header file:
>
> exclude_idle : 1, /* don't count when idle */
>
> Naively, I thought it would simply stop the event when running in the
> context of the idle task (swapper, pid 0) on any CPU. That would seem to
> match the succinct description.
>
> However, running a simple:
>
> $ perf record -a -e cycles:I sleep 5
> perf_event_attr:
> sample_type IP|TID|TIME|CPU|PERIOD
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID
> exclude_idle 1
>
> on an idle machine, showed me that this is not what is actually happening:
> $ perf script
> swapper 0 [000] 1132634.287442: 1 cycles:I:
> ffffffff8100b1fb __intel_pmu_enable_all.isra.17 ([kernel.kallsyms])
> swapper 0 [001] 1132634.287498: 1 cycles:I:
> ffffffff8100b1fb __intel_pmu_enable_all.isra.17 ([kernel.kallsyms])
>
>
> After looking at the code, it all made sense, it seems to current
> implementation is only relevant for sw events. I don't understand why.
>
> I am left wondering what is the goal of exclude_idle?

A "git grep exclude_idle" seems to suggest powerpc/arm have it
immplemented for their PMU. If we then look at commit:

2743a5b0fa6f ("perfcounters: provide expansion room in the ABI")

It was Paul who introduced the bit.

So I'm thinking that if x86 doesn't implement it, we should at least
error out on it. Of course, so far we've allowed it, so who knows what
all will break if we start asserting that :/



2018-04-20 08:37:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf/core: what is exclude_idle supposed to do

On Wed, Apr 18, 2018 at 11:10:20AM -0400, Vince Weaver wrote:
> On Tue, 17 Apr 2018, Jiri Olsa wrote:
>
> > On Mon, Apr 16, 2018 at 10:04:53PM +0000, Stephane Eranian wrote:
> > > Hi,
> > >
> > > I am trying to understand what the exclude_idle event attribute is supposed
> > > to accomplish.
> > > As per the definition in the header file:
> > >
> > > exclude_idle : 1, /* don't count when idle */
> >
> > AFAICS it's not implemented
>
> so just to be completely clear hear, we're saying that the "exclude_idle"
> modifier has never done anything useful and still doesn't?

AFAICT it works on Power and possibly ARM.

2018-04-20 14:19:46

by Vince Weaver

[permalink] [raw]
Subject: Re: [RFC] perf/core: what is exclude_idle supposed to do

On Fri, 20 Apr 2018, Peter Zijlstra wrote:

> On Wed, Apr 18, 2018 at 11:10:20AM -0400, Vince Weaver wrote:
> > On Tue, 17 Apr 2018, Jiri Olsa wrote:
> >
> > > On Mon, Apr 16, 2018 at 10:04:53PM +0000, Stephane Eranian wrote:
> > > > Hi,
> > > >
> > > > I am trying to understand what the exclude_idle event attribute is supposed
> > > > to accomplish.
> > > > As per the definition in the header file:
> > > >
> > > > exclude_idle : 1, /* don't count when idle */
> > >
> > > AFAICS it's not implemented
> >
> > so just to be completely clear hear, we're saying that the "exclude_idle"
> > modifier has never done anything useful and still doesn't?
>
> AFAICT it works on Power and possibly ARM.

at least some ARMs are a bit more honest about it than x86

ivybridge:
Performance counter stats for '/bin/ls':
1,368,162 instructions
1,368,162 instructions:I

pi2/ARM cortex-A7
Performance counter stats for '/bin/ls':
1,910,083 instructions
<not supported> instructions:I

I'd fire up my Power8 machine to see but not sure it's worth the hassle
and/or having to get out the ear protection.

Vince

2018-04-20 16:54:41

by Vince Weaver

[permalink] [raw]
Subject: Re: [RFC] perf/core: what is exclude_idle supposed to do

On Fri, 20 Apr 2018, Vince Weaver wrote:

> > AFAICT it works on Power and possibly ARM.
>
> at least some ARMs are a bit more honest about it than x86
>
> ivybridge:
> Performance counter stats for '/bin/ls':
> 1,368,162 instructions
> 1,368,162 instructions:I
>
> pi2/ARM cortex-A7
> Performance counter stats for '/bin/ls':
> 1,910,083 instructions
> <not supported> instructions:I
>
> I'd fire up my Power8 machine to see but not sure it's worth the hassle
> and/or having to get out the ear protection.

I did power up the Power8 machine in the end:

power8:
perf stat -e cycles,cycles:I sleep 5
Performance counter stats for 'sleep 5':
14,271,273 cycles
14,271,273 cycles:I

???

But then if I try again on power8

perf stat -a -e cycles,cycles:I sleep 5
Performance counter stats for 'system wide':
1,238,772,322,327 cycles
1,238,674,771,713 cycles:I

there is a difference.

But then on ivybridge

perf stat -a -e cycles,cycles:I sleep 5
Performance counter stats for 'system wide':
589,598,104 cycles
589,537,190 cycles:I

there is also a different in system wide mode.

So maybe exclude_idle does do something on x86? Or am I completely
misunderstanding what the flag is supposed to be indicating?

Vince

2018-04-20 18:22:23

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC] perf/core: what is exclude_idle supposed to do

On Fri, Apr 20, 2018 at 9:51 AM Vince Weaver <[email protected]>
wrote:

> On Fri, 20 Apr 2018, Vince Weaver wrote:

> > > AFAICT it works on Power and possibly ARM.
> >
> > at least some ARMs are a bit more honest about it than x86
> >
> > ivybridge:
> > Performance counter stats for '/bin/ls':
> > 1,368,162 instructions
> > 1,368,162 instructions:I
> >
> > pi2/ARM cortex-A7
> > Performance counter stats for '/bin/ls':
> > 1,910,083 instructions
> > <not supported> instructions:I
> >
> > I'd fire up my Power8 machine to see but not sure it's worth the hassle
> > and/or having to get out the ear protection.

> I did power up the Power8 machine in the end:

> power8:
> perf stat -e cycles,cycles:I sleep 5
> Performance counter stats for 'sleep 5':
> 14,271,273 cycles
> 14,271,273 cycles:I

> ???

> But then if I try again on power8

> perf stat -a -e cycles,cycles:I sleep 5
> Performance counter stats for 'system wide':
> 1,238,772,322,327 cycles
> 1,238,674,771,713 cycles:I

> there is a difference.

> But then on ivybridge

> perf stat -a -e cycles,cycles:I sleep 5
> Performance counter stats for 'system wide':
> 589,598,104 cycles
> 589,537,190 cycles:I

This may be noise.

The way the flag is named leads me to believe its goal is to not count when
executing in the context of the idle task (pid 0 on each CPU).
However, it does not seem to be implemented that way. If you were to
implement this, then in system wide mode you'd have to check
the incoming task on ctxsw, very much like we do in cgroup monitoring. So
it would not be totally free. One can argue, in sampling mode
you can eliminate the samples coming from PID=0 in the tool. But there
would be nothing to cover counting mode.

Interestingly, there is also code in perf tool to exclude known idle
routines from reporting. But this is targeted to only some routines that the
idle task may end up executing. so it is not quite the same.


> So maybe exclude_idle does do something on x86? Or am I completely
> misunderstanding what the flag is supposed to be indicating?

> Vince