2013-07-23 00:31:14

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 0/8] perf: Finer grained full dynticks handling

Hi,

This patchset inspires from Peterz patch in http://marc.info/?l=linux-kernel&m=137155182121860
to make perf retain the timer tick less often, ie: only when there are freq events
or when one throttles.

The main purpose is to make the lockup detector work with full dynticks, as I suspect
that distros want to enable both...


You can fetch from:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
perf/nohz

Thanks,
Frederic
---

Frederic Weisbecker (8):
perf: Fix branch stack refcount leak on callchain init failure
perf: Sanitize get_callchain_buffer()
perf: Gather event accounting code
perf: Split per cpu event accounting code
perf: Migrate per cpu event accounting
perf: Account freq events per cpu
perf: Finer grained full dynticks kick
watchdog: Remove hack to make full dynticks working


kernel/events/callchain.c | 2 +
kernel/events/core.c | 203 ++++++++++++++++++++++++++++-----------------
kernel/watchdog.c | 8 --
3 files changed, 128 insertions(+), 85 deletions(-)


2013-07-23 00:31:16

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 1/8] perf: Fix branch stack refcount leak on callchain init failure

On callchain buffers allocation failure, free_event() is
called and all the accounting performed in perf_event_alloc()
for that event is cancelled.

But if the event has branch stack sampling, it is unaccounted
as well from the branch stack sampling events refcounts.

This is a bug because this accounting is performed after the
callchain buffer allocation. As a result, the branch stack sampling
events refcount can become negative.

To fix this, move the branch stack event accounting before the
callchain buffer allocation.

Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
kernel/events/core.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5e2bce9..7ffb81e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6566,6 +6566,12 @@ done:
atomic_inc(&nr_comm_events);
if (event->attr.task)
atomic_inc(&nr_task_events);
+ if (has_branch_stack(event)) {
+ static_key_slow_inc(&perf_sched_events.key);
+ if (!(event->attach_state & PERF_ATTACH_TASK))
+ atomic_inc(&per_cpu(perf_branch_stack_events,
+ event->cpu));
+ }
if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
err = get_callchain_buffers();
if (err) {
@@ -6573,12 +6579,6 @@ done:
return ERR_PTR(err);
}
}
- if (has_branch_stack(event)) {
- static_key_slow_inc(&perf_sched_events.key);
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_inc(&per_cpu(perf_branch_stack_events,
- event->cpu));
- }
}

return event;
--
1.7.5.4

2013-07-23 00:31:24

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 6/8] perf: Account freq events per cpu

This is going to be used by the full dynticks subsystem
as a finer-grained information to know when to keep and
when to stop the tick.

Original-patch-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
kernel/events/core.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index b40c3db..f9bd39b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -141,6 +141,7 @@ enum event_type_t {
struct static_key_deferred perf_sched_events __read_mostly;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
+static DEFINE_PER_CPU(atomic_t, perf_freq_events);

static atomic_t nr_mmap_events __read_mostly;
static atomic_t nr_comm_events __read_mostly;
@@ -3139,6 +3140,9 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
}
if (is_cgroup_event(event))
atomic_dec(&per_cpu(perf_cgroup_events, cpu));
+
+ if (event->attr.freq)
+ atomic_dec(&per_cpu(perf_freq_events, cpu));
}

static void unaccount_event(struct perf_event *event)
@@ -6473,6 +6477,9 @@ static void account_event_cpu(struct perf_event *event, int cpu)
}
if (is_cgroup_event(event))
atomic_inc(&per_cpu(perf_cgroup_events, cpu));
+
+ if (event->attr.freq)
+ atomic_inc(&per_cpu(perf_freq_events, cpu));
}

static void account_event(struct perf_event *event)
--
1.7.5.4

2013-07-23 00:31:19

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 3/8] perf: Gather event accounting code

Gather all the event accounting code to a single place,
once all the prerequisites are completed. This simplifies
the refcounting.

Original-patch-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
kernel/events/core.c | 79 +++++++++++++++++++++++++++++--------------------
1 files changed, 47 insertions(+), 32 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index c5f435f..3bb73af 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3128,6 +3128,21 @@ static void free_event_rcu(struct rcu_head *head)
static void ring_buffer_put(struct ring_buffer *rb);
static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);

+static void __free_event(struct perf_event *event)
+{
+ if (!event->parent) {
+ if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
+ put_callchain_buffers();
+ }
+
+ if (event->destroy)
+ event->destroy(event);
+
+ if (event->ctx)
+ put_ctx(event->ctx);
+
+ call_rcu(&event->rcu_head, free_event_rcu);
+}
static void free_event(struct perf_event *event)
{
irq_work_sync(&event->pending);
@@ -3141,8 +3156,6 @@ static void free_event(struct perf_event *event)
atomic_dec(&nr_comm_events);
if (event->attr.task)
atomic_dec(&nr_task_events);
- if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
- put_callchain_buffers();
if (is_cgroup_event(event)) {
atomic_dec(&per_cpu(perf_cgroup_events, event->cpu));
static_key_slow_dec_deferred(&perf_sched_events);
@@ -3180,13 +3193,8 @@ static void free_event(struct perf_event *event)
if (is_cgroup_event(event))
perf_detach_cgroup(event);

- if (event->destroy)
- event->destroy(event);
-
- if (event->ctx)
- put_ctx(event->ctx);

- call_rcu(&event->rcu_head, free_event_rcu);
+ __free_event(event);
}

int perf_event_release_kernel(struct perf_event *event)
@@ -6442,6 +6450,29 @@ unlock:
return pmu;
}

+static void account_event(struct perf_event *event)
+{
+ if (event->attach_state & PERF_ATTACH_TASK)
+ static_key_slow_inc(&perf_sched_events.key);
+ if (event->attr.mmap || event->attr.mmap_data)
+ atomic_inc(&nr_mmap_events);
+ if (event->attr.comm)
+ atomic_inc(&nr_comm_events);
+ if (event->attr.task)
+ atomic_inc(&nr_task_events);
+ if (has_branch_stack(event)) {
+ static_key_slow_inc(&perf_sched_events.key);
+ if (!(event->attach_state & PERF_ATTACH_TASK))
+ atomic_inc(&per_cpu(perf_branch_stack_events,
+ event->cpu));
+ }
+
+ if (is_cgroup_event(event)) {
+ atomic_inc(&per_cpu(perf_cgroup_events, event->cpu));
+ static_key_slow_inc(&perf_sched_events.key);
+ }
+}
+
/*
* Allocate and initialize a event structure
*/
@@ -6555,21 +6586,6 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
if (err)
goto err_pmu;
}
-
- if (event->attach_state & PERF_ATTACH_TASK)
- static_key_slow_inc(&perf_sched_events.key);
- if (event->attr.mmap || event->attr.mmap_data)
- atomic_inc(&nr_mmap_events);
- if (event->attr.comm)
- atomic_inc(&nr_comm_events);
- if (event->attr.task)
- atomic_inc(&nr_task_events);
- if (has_branch_stack(event)) {
- static_key_slow_inc(&perf_sched_events.key);
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_inc(&per_cpu(perf_branch_stack_events,
- event->cpu));
- }
}

return event;
@@ -6864,17 +6880,14 @@ SYSCALL_DEFINE5(perf_event_open,

if (flags & PERF_FLAG_PID_CGROUP) {
err = perf_cgroup_connect(pid, event, &attr, group_leader);
- if (err)
- goto err_alloc;
- /*
- * one more event:
- * - that has cgroup constraint on event->cpu
- * - that may need work on context switch
- */
- atomic_inc(&per_cpu(perf_cgroup_events, event->cpu));
- static_key_slow_inc(&perf_sched_events.key);
+ if (err) {
+ __free_event(event);
+ goto err_task;
+ }
}

+ account_event(event);
+
/*
* Special case software events and allow them to be part of
* any hardware group.
@@ -7070,6 +7083,8 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
goto err;
}

+ account_event(event);
+
ctx = find_get_context(event->pmu, task, cpu);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);
--
1.7.5.4

2013-07-23 00:31:38

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 8/8] watchdog: Remove hack to make full dynticks working

A perf event can be used without forcing the tick to
stay alive if it doesn't use a frequency but a sample
period and if it doesn't throttle (raise storm of events).

Since the lockup detector neither use a perf event frequency
nor should ever throttle due to its high period, it can now
run concurrently with the full dynticks feature.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Srivatsa S. Bhat <[email protected]>
Cc: Anish Singh <[email protected]>
---
kernel/watchdog.c | 8 --------
1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 1241d8c..51c4f34 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -553,14 +553,6 @@ void __init lockup_detector_init(void)
{
set_sample_period();

-#ifdef CONFIG_NO_HZ_FULL
- if (watchdog_user_enabled) {
- watchdog_user_enabled = 0;
- pr_warning("Disabled lockup detectors by default for full dynticks\n");
- pr_warning("You can reactivate it with 'sysctl -w kernel.watchdog=1'\n");
- }
-#endif
-
if (watchdog_user_enabled)
watchdog_enable_all_cpus();
}
--
1.7.5.4

2013-07-23 00:31:23

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 5/8] perf: Migrate per cpu event accounting

When an event is migrated, move the event per-cpu
accounting accordingly so that branch stack and cgroup
events work correctly on the new CPU.

Original-patch-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
kernel/events/core.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index fd820a4..b40c3db 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7144,6 +7144,7 @@ void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
list_for_each_entry_safe(event, tmp, &src_ctx->event_list,
event_entry) {
perf_remove_from_context(event);
+ unaccount_event_cpu(event, src_cpu);
put_ctx(src_ctx);
list_add(&event->event_entry, &events);
}
@@ -7156,6 +7157,7 @@ void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
list_del(&event->event_entry);
if (event->state >= PERF_EVENT_STATE_OFF)
event->state = PERF_EVENT_STATE_INACTIVE;
+ account_event_cpu(event, dst_cpu);
perf_install_in_context(dst_ctx, event, dst_cpu);
get_ctx(dst_ctx);
}
--
1.7.5.4

2013-07-23 00:31:53

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 7/8] perf: Finer grained full dynticks kick

Currently the full dynticks subsystem keep the
tick alive as long as there are perf events running.

This prevents the tick from being stopped as long as features
such that the lockup detectors are running. As a temporary fix,
the lockup detector is disabled by default when full dynticks
is built but this is not a long term viable solution.

To fix this, only keep the tick alive when an event configured
with a frequency rather than a period is running on the CPU,
or when an event throttles on the CPU.

These are the only purposes of the perf tick, especially now that
the rotation of flexible events is handled from a seperate hrtimer.
The tick can be shutdown the rest of the time.

Original-patch-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
kernel/events/core.c | 17 +++++++++--------
1 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f9bd39b..61d70b7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -870,12 +870,8 @@ static void perf_pmu_rotate_start(struct pmu *pmu)

WARN_ON(!irqs_disabled());

- if (list_empty(&cpuctx->rotation_list)) {
- int was_empty = list_empty(head);
+ if (list_empty(&cpuctx->rotation_list))
list_add(&cpuctx->rotation_list, head);
- if (was_empty)
- tick_nohz_full_kick();
- }
}

static void get_ctx(struct perf_event_context *ctx)
@@ -1875,6 +1871,9 @@ static int __perf_install_in_context(void *info)
perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, task_ctx);

+ if (atomic_read(&__get_cpu_var(perf_freq_events)))
+ tick_nohz_full_kick();
+
return 0;
}

@@ -2812,10 +2811,11 @@ done:
#ifdef CONFIG_NO_HZ_FULL
bool perf_event_can_stop_tick(void)
{
- if (list_empty(&__get_cpu_var(rotation_list)))
- return true;
- else
+ if (atomic_read(&__get_cpu_var(perf_freq_events)) ||
+ __this_cpu_read(perf_throttled_count))
return false;
+ else
+ return true;
}
#endif

@@ -5201,6 +5201,7 @@ static int __perf_event_overflow(struct perf_event *event,
__this_cpu_inc(perf_throttled_count);
hwc->interrupts = MAX_INTERRUPTS;
perf_log_throttle(event, 0);
+ tick_nohz_full_kick();
ret = 1;
}
}
--
1.7.5.4

2013-07-23 00:32:28

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 4/8] perf: Split per cpu event accounting code

Split the per-cpu accounting part of the event
accounting code.

This way we can use the per-cpu handling seperately.
This is going to be used by to fix the event migration
code accounting.

Original-patch-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
kernel/events/core.c | 87 +++++++++++++++++++++++++++++++------------------
1 files changed, 55 insertions(+), 32 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3bb73af..fd820a4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3128,6 +3128,40 @@ static void free_event_rcu(struct rcu_head *head)
static void ring_buffer_put(struct ring_buffer *rb);
static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);

+static void unaccount_event_cpu(struct perf_event *event, int cpu)
+{
+ if (event->parent)
+ return;
+
+ if (has_branch_stack(event)) {
+ if (!(event->attach_state & PERF_ATTACH_TASK))
+ atomic_dec(&per_cpu(perf_branch_stack_events, cpu));
+ }
+ if (is_cgroup_event(event))
+ atomic_dec(&per_cpu(perf_cgroup_events, cpu));
+}
+
+static void unaccount_event(struct perf_event *event)
+{
+ if (event->parent)
+ return;
+
+ if (event->attach_state & PERF_ATTACH_TASK)
+ static_key_slow_dec_deferred(&perf_sched_events);
+ if (event->attr.mmap || event->attr.mmap_data)
+ atomic_dec(&nr_mmap_events);
+ if (event->attr.comm)
+ atomic_dec(&nr_comm_events);
+ if (event->attr.task)
+ atomic_dec(&nr_task_events);
+ if (is_cgroup_event(event))
+ static_key_slow_dec_deferred(&perf_sched_events);
+ if (has_branch_stack(event))
+ static_key_slow_dec_deferred(&perf_sched_events);
+
+ unaccount_event_cpu(event, event->cpu);
+}
+
static void __free_event(struct perf_event *event)
{
if (!event->parent) {
@@ -3147,29 +3181,7 @@ static void free_event(struct perf_event *event)
{
irq_work_sync(&event->pending);

- if (!event->parent) {
- if (event->attach_state & PERF_ATTACH_TASK)
- static_key_slow_dec_deferred(&perf_sched_events);
- if (event->attr.mmap || event->attr.mmap_data)
- atomic_dec(&nr_mmap_events);
- if (event->attr.comm)
- atomic_dec(&nr_comm_events);
- if (event->attr.task)
- atomic_dec(&nr_task_events);
- if (is_cgroup_event(event)) {
- atomic_dec(&per_cpu(perf_cgroup_events, event->cpu));
- static_key_slow_dec_deferred(&perf_sched_events);
- }
-
- if (has_branch_stack(event)) {
- static_key_slow_dec_deferred(&perf_sched_events);
- /* is system-wide event */
- if (!(event->attach_state & PERF_ATTACH_TASK)) {
- atomic_dec(&per_cpu(perf_branch_stack_events,
- event->cpu));
- }
- }
- }
+ unaccount_event(event);

if (event->rb) {
struct ring_buffer *rb;
@@ -6450,8 +6462,24 @@ unlock:
return pmu;
}

+static void account_event_cpu(struct perf_event *event, int cpu)
+{
+ if (event->parent)
+ return;
+
+ if (has_branch_stack(event)) {
+ if (!(event->attach_state & PERF_ATTACH_TASK))
+ atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
+ }
+ if (is_cgroup_event(event))
+ atomic_inc(&per_cpu(perf_cgroup_events, cpu));
+}
+
static void account_event(struct perf_event *event)
{
+ if (event->parent)
+ return;
+
if (event->attach_state & PERF_ATTACH_TASK)
static_key_slow_inc(&perf_sched_events.key);
if (event->attr.mmap || event->attr.mmap_data)
@@ -6460,17 +6488,12 @@ static void account_event(struct perf_event *event)
atomic_inc(&nr_comm_events);
if (event->attr.task)
atomic_inc(&nr_task_events);
- if (has_branch_stack(event)) {
+ if (has_branch_stack(event))
static_key_slow_inc(&perf_sched_events.key);
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_inc(&per_cpu(perf_branch_stack_events,
- event->cpu));
- }
-
- if (is_cgroup_event(event)) {
- atomic_inc(&per_cpu(perf_cgroup_events, event->cpu));
+ if (is_cgroup_event(event))
static_key_slow_inc(&perf_sched_events.key);
- }
+
+ account_event_cpu(event, event->cpu);
}

/*
--
1.7.5.4

2013-07-23 00:32:44

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/8] perf: Sanitize get_callchain_buffer()

In case of allocation failure, get_callchain_buffer() keeps the
refcount incremented for the current event.

As a result, when get_callchain_buffers() returns an error,
we must cleanup what it did by cancelling its last refcount
with a call to put_callchain_buffers().

This is a hack in order to be able to call free_event()
after that failure.

The original purpose of that was to simplify the failure
path. But this error handling is actually counter intuitive,
ugly and not very easy to follow because one expect to
see the resources used to perform a service to be cleaned
by the callee if case of failure, not by the caller.

So lets clean this up by cancelling the refcount from
get_callchain_buffer() in case of failure. And correctly free
the event accordingly in perf_event_alloc().

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
kernel/events/callchain.c | 2 ++
kernel/events/core.c | 41 +++++++++++++++++++++--------------------
2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index c772061..76a8bc5 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -117,6 +117,8 @@ int get_callchain_buffers(void)
err = alloc_callchain_buffers();
exit:
mutex_unlock(&callchain_mutex);
+ if (err)
+ atomic_dec(&nr_callchain_events);

return err;
}
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7ffb81e..c5f435f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6456,7 +6456,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
struct pmu *pmu;
struct perf_event *event;
struct hw_perf_event *hwc;
- long err;
+ long err = -EINVAL;

if ((unsigned)cpu >= nr_cpu_ids) {
if (!task || cpu != -1)
@@ -6539,25 +6539,23 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
* we currently do not support PERF_FORMAT_GROUP on inherited events
*/
if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))
- goto done;
+ goto err_ns;

pmu = perf_init_event(event);
-
-done:
- err = 0;
if (!pmu)
- err = -EINVAL;
- else if (IS_ERR(pmu))
+ goto err_ns;
+ else if (IS_ERR(pmu)) {
err = PTR_ERR(pmu);
-
- if (err) {
- if (event->ns)
- put_pid_ns(event->ns);
- kfree(event);
- return ERR_PTR(err);
+ goto err_ns;
}

if (!event->parent) {
+ if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
+ err = get_callchain_buffers();
+ if (err)
+ goto err_pmu;
+ }
+
if (event->attach_state & PERF_ATTACH_TASK)
static_key_slow_inc(&perf_sched_events.key);
if (event->attr.mmap || event->attr.mmap_data)
@@ -6572,16 +6570,19 @@ done:
atomic_inc(&per_cpu(perf_branch_stack_events,
event->cpu));
}
- if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
- err = get_callchain_buffers();
- if (err) {
- free_event(event);
- return ERR_PTR(err);
- }
- }
}

return event;
+
+err_pmu:
+ if (event->destroy)
+ event->destroy(event);
+err_ns:
+ if (event->ns)
+ put_pid_ns(event->ns);
+ kfree(event);
+
+ return ERR_PTR(err);
}

static int perf_copy_attr(struct perf_event_attr __user *uattr,
--
1.7.5.4

2013-07-23 12:34:04

by Don Zickus

[permalink] [raw]
Subject: Re: [PATCH 8/8] watchdog: Remove hack to make full dynticks working

On Tue, Jul 23, 2013 at 02:31:06AM +0200, Frederic Weisbecker wrote:
> A perf event can be used without forcing the tick to
> stay alive if it doesn't use a frequency but a sample
> period and if it doesn't throttle (raise storm of events).
>
> Since the lockup detector neither use a perf event frequency
> nor should ever throttle due to its high period, it can now
> run concurrently with the full dynticks feature.

Thanks. Dumb question, I keep wondering if the lockup detector would be
better or worse off if it used the perf event frequency as opposed to
using a sample period? The idea is it could follow the varying cpu
frequencies better (and probably simplify some of the code too).

Acked-by: Don Zickus <[email protected]>


>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Jiri Olsa <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Stephane Eranian <[email protected]>
> Cc: Don Zickus <[email protected]>
> Cc: Srivatsa S. Bhat <[email protected]>
> Cc: Anish Singh <[email protected]>
> ---
> kernel/watchdog.c | 8 --------
> 1 files changed, 0 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 1241d8c..51c4f34 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -553,14 +553,6 @@ void __init lockup_detector_init(void)
> {
> set_sample_period();
>
> -#ifdef CONFIG_NO_HZ_FULL
> - if (watchdog_user_enabled) {
> - watchdog_user_enabled = 0;
> - pr_warning("Disabled lockup detectors by default for full dynticks\n");
> - pr_warning("You can reactivate it with 'sysctl -w kernel.watchdog=1'\n");
> - }
> -#endif
> -
> if (watchdog_user_enabled)
> watchdog_enable_all_cpus();
> }
> --
> 1.7.5.4
>

2013-07-23 12:44:09

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 8/8] watchdog: Remove hack to make full dynticks working

On Tue, Jul 23, 2013 at 08:33:31AM -0400, Don Zickus wrote:
> On Tue, Jul 23, 2013 at 02:31:06AM +0200, Frederic Weisbecker wrote:
> > A perf event can be used without forcing the tick to
> > stay alive if it doesn't use a frequency but a sample
> > period and if it doesn't throttle (raise storm of events).
> >
> > Since the lockup detector neither use a perf event frequency
> > nor should ever throttle due to its high period, it can now
> > run concurrently with the full dynticks feature.
>
> Thanks. Dumb question, I keep wondering if the lockup detector would be
> better or worse off if it used the perf event frequency as opposed to
> using a sample period? The idea is it could follow the varying cpu
> frequencies better (and probably simplify some of the code too).
>
> Acked-by: Don Zickus <[email protected]>

Thanks!

IIRC we tried that and I believe the issue was that perf did not
support frequency below 1. So probably the limitation was that it had to
fire at least once per sec.

Also if you do that, we fall into the full dynticks problem again.

2013-07-23 12:45:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 8/8] watchdog: Remove hack to make full dynticks working

On Tue, Jul 23, 2013 at 08:33:31AM -0400, Don Zickus wrote:
> On Tue, Jul 23, 2013 at 02:31:06AM +0200, Frederic Weisbecker wrote:
> > A perf event can be used without forcing the tick to
> > stay alive if it doesn't use a frequency but a sample
> > period and if it doesn't throttle (raise storm of events).
> >
> > Since the lockup detector neither use a perf event frequency
> > nor should ever throttle due to its high period, it can now
> > run concurrently with the full dynticks feature.
>
> Thanks. Dumb question, I keep wondering if the lockup detector would be
> better or worse off if it used the perf event frequency as opposed to
> using a sample period? The idea is it could follow the varying cpu
> frequencies better (and probably simplify some of the code too).

Right, trouble is that someone didn't consider fractional frequencies
when writing the interface :/ Lowest we can go is 1 Hz while we'd want
something like 0.1 Hz or smaller.

Also, like the above says, that would interfere with the nohz efforts as
perf needs the tick to re-compute those frequency thingies.

2013-07-25 09:59:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/8] perf: Finer grained full dynticks handling

On Tue, Jul 23, 2013 at 02:30:58AM +0200, Frederic Weisbecker wrote:
> Hi,
>
> This patchset inspires from Peterz patch in http://marc.info/?l=linux-kernel&m=137155182121860
> to make perf retain the timer tick less often, ie: only when there are freq events
> or when one throttles.
>
> The main purpose is to make the lockup detector work with full dynticks, as I suspect
> that distros want to enable both...
>
>
> You can fetch from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> perf/nohz
>
> Thanks,
> Frederic

Awesome, looks good, thanks Frederic!

Acked-by: Peter Zijlstra <[email protected]>

2013-07-25 14:02:24

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/8] perf: Finer grained full dynticks handling

On Thu, Jul 25, 2013 at 11:59:38AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 23, 2013 at 02:30:58AM +0200, Frederic Weisbecker wrote:
> > Hi,
> >
> > This patchset inspires from Peterz patch in http://marc.info/?l=linux-kernel&m=137155182121860
> > to make perf retain the timer tick less often, ie: only when there are freq events
> > or when one throttles.
> >
> > The main purpose is to make the lockup detector work with full dynticks, as I suspect
> > that distros want to enable both...
> >
> >
> > You can fetch from:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > perf/nohz
> >
> > Thanks,
> > Frederic
>
> Awesome, looks good, thanks Frederic!
>
> Acked-by: Peter Zijlstra <[email protected]>

Great! What do you prefer? Do you want to take these or should I do a pull request
to Ingo directly?

Thanks!

2013-07-25 16:29:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/8] perf: Finer grained full dynticks handling

On Thu, Jul 25, 2013 at 04:02:17PM +0200, Frederic Weisbecker wrote:
> > Awesome, looks good, thanks Frederic!
> >
> > Acked-by: Peter Zijlstra <[email protected]>
>
> Great! What do you prefer? Do you want to take these or should I do a pull request
> to Ingo directly?

I had intended Ingo to pick these up; but then I remembered he's playing
road warrior atm. So I stuck them in my queue so as not to loose them.

2013-07-25 20:07:45

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/8] perf: Finer grained full dynticks handling

On Thu, Jul 25, 2013 at 06:29:39PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 25, 2013 at 04:02:17PM +0200, Frederic Weisbecker wrote:
> > > Awesome, looks good, thanks Frederic!
> > >
> > > Acked-by: Peter Zijlstra <[email protected]>
> >
> > Great! What do you prefer? Do you want to take these or should I do a pull request
> > to Ingo directly?
>
> I had intended Ingo to pick these up; but then I remembered he's playing
> road warrior atm. So I stuck them in my queue so as not to loose them.

Thanks!

Subject: [tip:perf/core] perf: Fix branch stack refcount leak on callchain init failure

Commit-ID: 6050cb0b0b366092d1383bc23d7b16cd26db00f0
Gitweb: http://git.kernel.org/tip/6050cb0b0b366092d1383bc23d7b16cd26db00f0
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Tue, 23 Jul 2013 02:30:59 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 30 Jul 2013 22:22:58 +0200

perf: Fix branch stack refcount leak on callchain init failure

On callchain buffers allocation failure, free_event() is
called and all the accounting performed in perf_event_alloc()
for that event is cancelled.

But if the event has branch stack sampling, it is unaccounted
as well from the branch stack sampling events refcounts.

This is a bug because this accounting is performed after the
callchain buffer allocation. As a result, the branch stack sampling
events refcount can become negative.

To fix this, move the branch stack event accounting before the
callchain buffer allocation.

Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1274114..f35aa7e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6567,6 +6567,12 @@ done:
atomic_inc(&nr_comm_events);
if (event->attr.task)
atomic_inc(&nr_task_events);
+ if (has_branch_stack(event)) {
+ static_key_slow_inc(&perf_sched_events.key);
+ if (!(event->attach_state & PERF_ATTACH_TASK))
+ atomic_inc(&per_cpu(perf_branch_stack_events,
+ event->cpu));
+ }
if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
err = get_callchain_buffers();
if (err) {
@@ -6574,12 +6580,6 @@ done:
return ERR_PTR(err);
}
}
- if (has_branch_stack(event)) {
- static_key_slow_inc(&perf_sched_events.key);
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_inc(&per_cpu(perf_branch_stack_events,
- event->cpu));
- }
}

return event;

Subject: [tip:perf/core] perf: Sanitize get_callchain_buffer()

Commit-ID: 90983b16078ab0fdc58f0dab3e8e3da79c9579a2
Gitweb: http://git.kernel.org/tip/90983b16078ab0fdc58f0dab3e8e3da79c9579a2
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Tue, 23 Jul 2013 02:31:00 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 30 Jul 2013 22:29:12 +0200

perf: Sanitize get_callchain_buffer()

In case of allocation failure, get_callchain_buffer() keeps the
refcount incremented for the current event.

As a result, when get_callchain_buffers() returns an error,
we must cleanup what it did by cancelling its last refcount
with a call to put_callchain_buffers().

This is a hack in order to be able to call free_event()
after that failure.

The original purpose of that was to simplify the failure
path. But this error handling is actually counter intuitive,
ugly and not very easy to follow because one expect to
see the resources used to perform a service to be cleaned
by the callee if case of failure, not by the caller.

So lets clean this up by cancelling the refcount from
get_callchain_buffer() in case of failure. And correctly free
the event accordingly in perf_event_alloc().

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/callchain.c | 2 ++
kernel/events/core.c | 41 +++++++++++++++++++++--------------------
2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index c772061..76a8bc5 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -117,6 +117,8 @@ int get_callchain_buffers(void)
err = alloc_callchain_buffers();
exit:
mutex_unlock(&callchain_mutex);
+ if (err)
+ atomic_dec(&nr_callchain_events);

return err;
}
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f35aa7e..3b99862 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6457,7 +6457,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
struct pmu *pmu;
struct perf_event *event;
struct hw_perf_event *hwc;
- long err;
+ long err = -EINVAL;

if ((unsigned)cpu >= nr_cpu_ids) {
if (!task || cpu != -1)
@@ -6540,25 +6540,23 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
* we currently do not support PERF_FORMAT_GROUP on inherited events
*/
if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))
- goto done;
+ goto err_ns;

pmu = perf_init_event(event);
-
-done:
- err = 0;
if (!pmu)
- err = -EINVAL;
- else if (IS_ERR(pmu))
+ goto err_ns;
+ else if (IS_ERR(pmu)) {
err = PTR_ERR(pmu);
-
- if (err) {
- if (event->ns)
- put_pid_ns(event->ns);
- kfree(event);
- return ERR_PTR(err);
+ goto err_ns;
}

if (!event->parent) {
+ if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
+ err = get_callchain_buffers();
+ if (err)
+ goto err_pmu;
+ }
+
if (event->attach_state & PERF_ATTACH_TASK)
static_key_slow_inc(&perf_sched_events.key);
if (event->attr.mmap || event->attr.mmap_data)
@@ -6573,16 +6571,19 @@ done:
atomic_inc(&per_cpu(perf_branch_stack_events,
event->cpu));
}
- if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
- err = get_callchain_buffers();
- if (err) {
- free_event(event);
- return ERR_PTR(err);
- }
- }
}

return event;
+
+err_pmu:
+ if (event->destroy)
+ event->destroy(event);
+err_ns:
+ if (event->ns)
+ put_pid_ns(event->ns);
+ kfree(event);
+
+ return ERR_PTR(err);
}

static int perf_copy_attr(struct perf_event_attr __user *uattr,

Subject: [tip:perf/core] perf: Factor out event accounting code to account_event()/__free_event()

Commit-ID: 766d6c076928191d75ad5b0d0f58f52b1e7682d8
Gitweb: http://git.kernel.org/tip/766d6c076928191d75ad5b0d0f58f52b1e7682d8
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Tue, 23 Jul 2013 02:31:01 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 30 Jul 2013 22:29:12 +0200

perf: Factor out event accounting code to account_event()/__free_event()

Gather all the event accounting code to a single place,
once all the prerequisites are completed. This simplifies
the refcounting.

Original-patch-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 79 +++++++++++++++++++++++++++++++---------------------
1 file changed, 47 insertions(+), 32 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3b99862..158fd57 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3128,6 +3128,21 @@ static void free_event_rcu(struct rcu_head *head)
static void ring_buffer_put(struct ring_buffer *rb);
static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);

+static void __free_event(struct perf_event *event)
+{
+ if (!event->parent) {
+ if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
+ put_callchain_buffers();
+ }
+
+ if (event->destroy)
+ event->destroy(event);
+
+ if (event->ctx)
+ put_ctx(event->ctx);
+
+ call_rcu(&event->rcu_head, free_event_rcu);
+}
static void free_event(struct perf_event *event)
{
irq_work_sync(&event->pending);
@@ -3141,8 +3156,6 @@ static void free_event(struct perf_event *event)
atomic_dec(&nr_comm_events);
if (event->attr.task)
atomic_dec(&nr_task_events);
- if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
- put_callchain_buffers();
if (is_cgroup_event(event)) {
atomic_dec(&per_cpu(perf_cgroup_events, event->cpu));
static_key_slow_dec_deferred(&perf_sched_events);
@@ -3180,13 +3193,8 @@ static void free_event(struct perf_event *event)
if (is_cgroup_event(event))
perf_detach_cgroup(event);

- if (event->destroy)
- event->destroy(event);
-
- if (event->ctx)
- put_ctx(event->ctx);

- call_rcu(&event->rcu_head, free_event_rcu);
+ __free_event(event);
}

int perf_event_release_kernel(struct perf_event *event)
@@ -6443,6 +6451,29 @@ unlock:
return pmu;
}

+static void account_event(struct perf_event *event)
+{
+ if (event->attach_state & PERF_ATTACH_TASK)
+ static_key_slow_inc(&perf_sched_events.key);
+ if (event->attr.mmap || event->attr.mmap_data)
+ atomic_inc(&nr_mmap_events);
+ if (event->attr.comm)
+ atomic_inc(&nr_comm_events);
+ if (event->attr.task)
+ atomic_inc(&nr_task_events);
+ if (has_branch_stack(event)) {
+ static_key_slow_inc(&perf_sched_events.key);
+ if (!(event->attach_state & PERF_ATTACH_TASK))
+ atomic_inc(&per_cpu(perf_branch_stack_events,
+ event->cpu));
+ }
+
+ if (is_cgroup_event(event)) {
+ atomic_inc(&per_cpu(perf_cgroup_events, event->cpu));
+ static_key_slow_inc(&perf_sched_events.key);
+ }
+}
+
/*
* Allocate and initialize a event structure
*/
@@ -6556,21 +6587,6 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
if (err)
goto err_pmu;
}
-
- if (event->attach_state & PERF_ATTACH_TASK)
- static_key_slow_inc(&perf_sched_events.key);
- if (event->attr.mmap || event->attr.mmap_data)
- atomic_inc(&nr_mmap_events);
- if (event->attr.comm)
- atomic_inc(&nr_comm_events);
- if (event->attr.task)
- atomic_inc(&nr_task_events);
- if (has_branch_stack(event)) {
- static_key_slow_inc(&perf_sched_events.key);
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_inc(&per_cpu(perf_branch_stack_events,
- event->cpu));
- }
}

return event;
@@ -6865,17 +6881,14 @@ SYSCALL_DEFINE5(perf_event_open,

if (flags & PERF_FLAG_PID_CGROUP) {
err = perf_cgroup_connect(pid, event, &attr, group_leader);
- if (err)
- goto err_alloc;
- /*
- * one more event:
- * - that has cgroup constraint on event->cpu
- * - that may need work on context switch
- */
- atomic_inc(&per_cpu(perf_cgroup_events, event->cpu));
- static_key_slow_inc(&perf_sched_events.key);
+ if (err) {
+ __free_event(event);
+ goto err_task;
+ }
}

+ account_event(event);
+
/*
* Special case software events and allow them to be part of
* any hardware group.
@@ -7071,6 +7084,8 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
goto err;
}

+ account_event(event);
+
ctx = find_get_context(event->pmu, task, cpu);
if (IS_ERR(ctx)) {
err = PTR_ERR(ctx);

Subject: [tip:perf/core] perf: Split the per-cpu accounting part of the event accounting code

Commit-ID: 4beb31f3657348a8b702dd014d01c520e522012f
Gitweb: http://git.kernel.org/tip/4beb31f3657348a8b702dd014d01c520e522012f
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Tue, 23 Jul 2013 02:31:02 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 30 Jul 2013 22:29:13 +0200

perf: Split the per-cpu accounting part of the event accounting code

This way we can use the per-cpu handling seperately.
This is going to be used by to fix the event migration
code accounting.

Original-patch-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 87 +++++++++++++++++++++++++++++++++-------------------
1 file changed, 55 insertions(+), 32 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 158fd57..3a4b73a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3128,6 +3128,40 @@ static void free_event_rcu(struct rcu_head *head)
static void ring_buffer_put(struct ring_buffer *rb);
static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);

+static void unaccount_event_cpu(struct perf_event *event, int cpu)
+{
+ if (event->parent)
+ return;
+
+ if (has_branch_stack(event)) {
+ if (!(event->attach_state & PERF_ATTACH_TASK))
+ atomic_dec(&per_cpu(perf_branch_stack_events, cpu));
+ }
+ if (is_cgroup_event(event))
+ atomic_dec(&per_cpu(perf_cgroup_events, cpu));
+}
+
+static void unaccount_event(struct perf_event *event)
+{
+ if (event->parent)
+ return;
+
+ if (event->attach_state & PERF_ATTACH_TASK)
+ static_key_slow_dec_deferred(&perf_sched_events);
+ if (event->attr.mmap || event->attr.mmap_data)
+ atomic_dec(&nr_mmap_events);
+ if (event->attr.comm)
+ atomic_dec(&nr_comm_events);
+ if (event->attr.task)
+ atomic_dec(&nr_task_events);
+ if (is_cgroup_event(event))
+ static_key_slow_dec_deferred(&perf_sched_events);
+ if (has_branch_stack(event))
+ static_key_slow_dec_deferred(&perf_sched_events);
+
+ unaccount_event_cpu(event, event->cpu);
+}
+
static void __free_event(struct perf_event *event)
{
if (!event->parent) {
@@ -3147,29 +3181,7 @@ static void free_event(struct perf_event *event)
{
irq_work_sync(&event->pending);

- if (!event->parent) {
- if (event->attach_state & PERF_ATTACH_TASK)
- static_key_slow_dec_deferred(&perf_sched_events);
- if (event->attr.mmap || event->attr.mmap_data)
- atomic_dec(&nr_mmap_events);
- if (event->attr.comm)
- atomic_dec(&nr_comm_events);
- if (event->attr.task)
- atomic_dec(&nr_task_events);
- if (is_cgroup_event(event)) {
- atomic_dec(&per_cpu(perf_cgroup_events, event->cpu));
- static_key_slow_dec_deferred(&perf_sched_events);
- }
-
- if (has_branch_stack(event)) {
- static_key_slow_dec_deferred(&perf_sched_events);
- /* is system-wide event */
- if (!(event->attach_state & PERF_ATTACH_TASK)) {
- atomic_dec(&per_cpu(perf_branch_stack_events,
- event->cpu));
- }
- }
- }
+ unaccount_event(event);

if (event->rb) {
struct ring_buffer *rb;
@@ -6451,8 +6463,24 @@ unlock:
return pmu;
}

+static void account_event_cpu(struct perf_event *event, int cpu)
+{
+ if (event->parent)
+ return;
+
+ if (has_branch_stack(event)) {
+ if (!(event->attach_state & PERF_ATTACH_TASK))
+ atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
+ }
+ if (is_cgroup_event(event))
+ atomic_inc(&per_cpu(perf_cgroup_events, cpu));
+}
+
static void account_event(struct perf_event *event)
{
+ if (event->parent)
+ return;
+
if (event->attach_state & PERF_ATTACH_TASK)
static_key_slow_inc(&perf_sched_events.key);
if (event->attr.mmap || event->attr.mmap_data)
@@ -6461,17 +6489,12 @@ static void account_event(struct perf_event *event)
atomic_inc(&nr_comm_events);
if (event->attr.task)
atomic_inc(&nr_task_events);
- if (has_branch_stack(event)) {
+ if (has_branch_stack(event))
static_key_slow_inc(&perf_sched_events.key);
- if (!(event->attach_state & PERF_ATTACH_TASK))
- atomic_inc(&per_cpu(perf_branch_stack_events,
- event->cpu));
- }
-
- if (is_cgroup_event(event)) {
- atomic_inc(&per_cpu(perf_cgroup_events, event->cpu));
+ if (is_cgroup_event(event))
static_key_slow_inc(&perf_sched_events.key);
- }
+
+ account_event_cpu(event, event->cpu);
}

/*

Subject: [tip:perf/core] perf: Migrate per cpu event accounting

Commit-ID: 9a545de019b536771feefb76f85e5038b65c2190
Gitweb: http://git.kernel.org/tip/9a545de019b536771feefb76f85e5038b65c2190
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Tue, 23 Jul 2013 02:31:03 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 30 Jul 2013 22:29:14 +0200

perf: Migrate per cpu event accounting

When an event is migrated, move the event per-cpu
accounting accordingly so that branch stack and cgroup
events work correctly on the new CPU.

Original-patch-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3a4b73a..63bdec9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7145,6 +7145,7 @@ void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
list_for_each_entry_safe(event, tmp, &src_ctx->event_list,
event_entry) {
perf_remove_from_context(event);
+ unaccount_event_cpu(event, src_cpu);
put_ctx(src_ctx);
list_add(&event->event_entry, &events);
}
@@ -7157,6 +7158,7 @@ void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
list_del(&event->event_entry);
if (event->state >= PERF_EVENT_STATE_OFF)
event->state = PERF_EVENT_STATE_INACTIVE;
+ account_event_cpu(event, dst_cpu);
perf_install_in_context(dst_ctx, event, dst_cpu);
get_ctx(dst_ctx);
}

Subject: [tip:perf/core] perf: Account freq events per cpu

Commit-ID: ba8a75c16e292c0a3a87406a77508cbbc6cf4ee2
Gitweb: http://git.kernel.org/tip/ba8a75c16e292c0a3a87406a77508cbbc6cf4ee2
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Tue, 23 Jul 2013 02:31:04 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 30 Jul 2013 22:29:14 +0200

perf: Account freq events per cpu

This is going to be used by the full dynticks subsystem
as a finer-grained information to know when to keep and
when to stop the tick.

Original-patch-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 63bdec9..3fe385a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -141,6 +141,7 @@ enum event_type_t {
struct static_key_deferred perf_sched_events __read_mostly;
static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
+static DEFINE_PER_CPU(atomic_t, perf_freq_events);

static atomic_t nr_mmap_events __read_mostly;
static atomic_t nr_comm_events __read_mostly;
@@ -3139,6 +3140,9 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
}
if (is_cgroup_event(event))
atomic_dec(&per_cpu(perf_cgroup_events, cpu));
+
+ if (event->attr.freq)
+ atomic_dec(&per_cpu(perf_freq_events, cpu));
}

static void unaccount_event(struct perf_event *event)
@@ -6474,6 +6478,9 @@ static void account_event_cpu(struct perf_event *event, int cpu)
}
if (is_cgroup_event(event))
atomic_inc(&per_cpu(perf_cgroup_events, cpu));
+
+ if (event->attr.freq)
+ atomic_inc(&per_cpu(perf_freq_events, cpu));
}

static void account_event(struct perf_event *event)

Subject: [tip:perf/core] perf: Implement finer grained full dynticks kick

Commit-ID: d84153d6c96f61aa06429586284639f32debf03e
Gitweb: http://git.kernel.org/tip/d84153d6c96f61aa06429586284639f32debf03e
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Tue, 23 Jul 2013 02:31:05 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 30 Jul 2013 22:29:15 +0200

perf: Implement finer grained full dynticks kick

Currently the full dynticks subsystem keep the
tick alive as long as there are perf events running.

This prevents the tick from being stopped as long as features
such that the lockup detectors are running. As a temporary fix,
the lockup detector is disabled by default when full dynticks
is built but this is not a long term viable solution.

To fix this, only keep the tick alive when an event configured
with a frequency rather than a period is running on the CPU,
or when an event throttles on the CPU.

These are the only purposes of the perf tick, especially now that
the rotation of flexible events is handled from a seperate hrtimer.
The tick can be shutdown the rest of the time.

Original-patch-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3fe385a..916cf1f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -870,12 +870,8 @@ static void perf_pmu_rotate_start(struct pmu *pmu)

WARN_ON(!irqs_disabled());

- if (list_empty(&cpuctx->rotation_list)) {
- int was_empty = list_empty(head);
+ if (list_empty(&cpuctx->rotation_list))
list_add(&cpuctx->rotation_list, head);
- if (was_empty)
- tick_nohz_full_kick();
- }
}

static void get_ctx(struct perf_event_context *ctx)
@@ -1875,6 +1871,9 @@ static int __perf_install_in_context(void *info)
perf_pmu_enable(cpuctx->ctx.pmu);
perf_ctx_unlock(cpuctx, task_ctx);

+ if (atomic_read(&__get_cpu_var(perf_freq_events)))
+ tick_nohz_full_kick();
+
return 0;
}

@@ -2812,10 +2811,11 @@ done:
#ifdef CONFIG_NO_HZ_FULL
bool perf_event_can_stop_tick(void)
{
- if (list_empty(&__get_cpu_var(rotation_list)))
- return true;
- else
+ if (atomic_read(&__get_cpu_var(perf_freq_events)) ||
+ __this_cpu_read(perf_throttled_count))
return false;
+ else
+ return true;
}
#endif

@@ -5202,6 +5202,7 @@ static int __perf_event_overflow(struct perf_event *event,
__this_cpu_inc(perf_throttled_count);
hwc->interrupts = MAX_INTERRUPTS;
perf_log_throttle(event, 0);
+ tick_nohz_full_kick();
ret = 1;
}
}

Subject: [tip:perf/core] watchdog: Make it work under full dynticks

Commit-ID: 93786a5f6aeb9c032c1c240246c5aabcf457b38f
Gitweb: http://git.kernel.org/tip/93786a5f6aeb9c032c1c240246c5aabcf457b38f
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Tue, 23 Jul 2013 02:31:06 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 30 Jul 2013 22:29:15 +0200

watchdog: Make it work under full dynticks

A perf event can be used without forcing the tick to
stay alive if it doesn't use a frequency but a sample
period and if it doesn't throttle (raise storm of events).

Since the lockup detector neither use a perf event frequency
nor should ever throttle due to its high period, it can now
run concurrently with the full dynticks feature.

So remove the hack that disabled the watchdog.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Srivatsa S. Bhat <[email protected]>
Cc: Anish Singh <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/watchdog.c | 8 --------
1 file changed, 8 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 1241d8c..51c4f34 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -553,14 +553,6 @@ void __init lockup_detector_init(void)
{
set_sample_period();

-#ifdef CONFIG_NO_HZ_FULL
- if (watchdog_user_enabled) {
- watchdog_user_enabled = 0;
- pr_warning("Disabled lockup detectors by default for full dynticks\n");
- pr_warning("You can reactivate it with 'sysctl -w kernel.watchdog=1'\n");
- }
-#endif
-
if (watchdog_user_enabled)
watchdog_enable_all_cpus();
}

2013-08-01 12:47:30

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 6/8] perf: Account freq events per cpu

On Tue, Jul 23, 2013 at 02:31:04AM +0200, Frederic Weisbecker wrote:
> This is going to be used by the full dynticks subsystem
> as a finer-grained information to know when to keep and
> when to stop the tick.
>
> Original-patch-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Jiri Olsa <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Stephane Eranian <[email protected]>
> ---
> kernel/events/core.c | 7 +++++++
> 1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index b40c3db..f9bd39b 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -141,6 +141,7 @@ enum event_type_t {
> struct static_key_deferred perf_sched_events __read_mostly;
> static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
> static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
> +static DEFINE_PER_CPU(atomic_t, perf_freq_events);
>
> static atomic_t nr_mmap_events __read_mostly;
> static atomic_t nr_comm_events __read_mostly;
> @@ -3139,6 +3140,9 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
> }
> if (is_cgroup_event(event))
> atomic_dec(&per_cpu(perf_cgroup_events, cpu));
> +
> + if (event->attr.freq)
> + atomic_dec(&per_cpu(perf_freq_events, cpu));
> }
>
> static void unaccount_event(struct perf_event *event)
> @@ -6473,6 +6477,9 @@ static void account_event_cpu(struct perf_event *event, int cpu)
> }
> if (is_cgroup_event(event))
> atomic_inc(&per_cpu(perf_cgroup_events, cpu));
> +
> + if (event->attr.freq)
> + atomic_inc(&per_cpu(perf_freq_events, cpu));

cpu could be -1 in here.. getting:

[ 178.901881] BUG: unable to handle kernel paging request at 0000000f001cf45a
[ 178.901989] IP: [<ffffffff8110c32b>] account_event_cpu+0xbb/0xd0
[ 178.901989] PGD 751a5067 PUD 0
[ 178.901989] Oops: 0002 [#2] PREEMPT SMP
[ 178.901989] Modules linked in:
[ 178.901989] CPU: 0 PID: 1309 Comm: perf Tainted: G D W 3.11.0-rc3+ #281
[ 178.901989] Hardware name: Intel Corporation Montevina platform/To be filled by O.E.M., BIOS AMVACRB1.86C.0066.B00.0805070703 05/07/2008
[ 178.901989] task: ffff88007519e5c0 ti: ffff880074c08000 task.ti: ffff880074c08000
[ 178.901989] RIP: 0010:[<ffffffff8110c32b>] [<ffffffff8110c32b>] account_event_cpu+0xbb/0xd0
[ 178.901989] RSP: 0018:ffff880074c09e38 EFLAGS: 00010216
[ 178.901989] RAX: 0000000f001cf45a RBX: ffff880074c1a800 RCX: 0000000000100000
[ 178.901989] RDX: 000000000000002c RSI: 000000000000002c RDI: ffff880000120400
[ 178.901989] RBP: ffff880074c09e48 R08: 0000000000000000 R09: 00000000ffffffff
[ 178.901989] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffffffffff
[ 178.901989] R13: ffff880074c0e640 R14: 00000000ffffffff R15: 0000000000000000
[ 178.901989] FS: 00007fc670738980(0000) GS:ffff88007a600000(0000) knlGS:0000000000000000
[ 178.901989] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 178.901989] CR2: 0000000f001cf45a CR3: 00000000751a8000 CR4: 00000000000407f0
[ 178.901989] Stack:
[ 178.901989] ffff880074c1a800 ffff880074c09ee8 ffff880074c09e68 ffffffff8110c3b6
[ 178.901989] 00000000ffffffff ffff880074c1a800 ffff880074c09f78 ffffffff81113ae8
[ 178.901989] 0000000000000000 0000000000000001 ffff880074c09eb8 ffffffff00000060
[ 178.901989] Call Trace:
[ 178.901989] [<ffffffff8110c3b6>] account_event.part.66+0x76/0xa0
[ 178.901989] [<ffffffff81113ae8>] SyS_perf_event_open+0x678/0xdf0
[ 178.901989] [<ffffffff8100ed42>] ? syscall_trace_leave+0xb2/0x240
[ 178.901989] [<ffffffff81641512>] tracesys+0xd0/0xd5
[ 178.901989] Code: d4 48 c7 c0 28 f4 1c 00 48 03 04 d5 80 ad ce 81 f0 ff 00 f6 83 c9 00 00 00 04 74 a7 48 c7 c0 4c f4 1c 00 4a 03 04 e5 80 ad ce 81 <f0> ff 00 5b 41 5c 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
[ 178.901989] RIP [<ffffffff8110c32b>] account_event_cpu+0xbb/0xd0
[ 178.901989] RSP <ffff880074c09e38>
[ 178.901989] CR2: 0000000f001cf45a
[ 179.122749] ---[ end trace 6f2f4f69b01368fb ]---


jirka

2013-08-01 12:48:58

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 6/8] perf: Account freq events per cpu

On Thu, Aug 01, 2013 at 02:46:58PM +0200, Jiri Olsa wrote:
> On Tue, Jul 23, 2013 at 02:31:04AM +0200, Frederic Weisbecker wrote:
> > This is going to be used by the full dynticks subsystem
> > as a finer-grained information to know when to keep and
> > when to stop the tick.
> >
> > Original-patch-by: Peter Zijlstra <[email protected]>
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > Cc: Jiri Olsa <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Namhyung Kim <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Arnaldo Carvalho de Melo <[email protected]>
> > Cc: Stephane Eranian <[email protected]>
> > ---
> > kernel/events/core.c | 7 +++++++
> > 1 files changed, 7 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index b40c3db..f9bd39b 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -141,6 +141,7 @@ enum event_type_t {
> > struct static_key_deferred perf_sched_events __read_mostly;
> > static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
> > static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
> > +static DEFINE_PER_CPU(atomic_t, perf_freq_events);
> >
> > static atomic_t nr_mmap_events __read_mostly;
> > static atomic_t nr_comm_events __read_mostly;
> > @@ -3139,6 +3140,9 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
> > }
> > if (is_cgroup_event(event))
> > atomic_dec(&per_cpu(perf_cgroup_events, cpu));
> > +
> > + if (event->attr.freq)
> > + atomic_dec(&per_cpu(perf_freq_events, cpu));
> > }
> >
> > static void unaccount_event(struct perf_event *event)
> > @@ -6473,6 +6477,9 @@ static void account_event_cpu(struct perf_event *event, int cpu)
> > }
> > if (is_cgroup_event(event))
> > atomic_inc(&per_cpu(perf_cgroup_events, cpu));
> > +
> > + if (event->attr.freq)
> > + atomic_inc(&per_cpu(perf_freq_events, cpu));
>
> cpu could be -1 in here.. getting:
>
> [ 178.901881] BUG: unable to handle kernel paging request at 0000000f001cf45a
> [ 178.901989] IP: [<ffffffff8110c32b>] account_event_cpu+0xbb/0xd0
> [ 178.901989] PGD 751a5067 PUD 0
> [ 178.901989] Oops: 0002 [#2] PREEMPT SMP
> [ 178.901989] Modules linked in:
> [ 178.901989] CPU: 0 PID: 1309 Comm: perf Tainted: G D W 3.11.0-rc3+ #281
> [ 178.901989] Hardware name: Intel Corporation Montevina platform/To be filled by O.E.M., BIOS AMVACRB1.86C.0066.B00.0805070703 05/07/2008
> [ 178.901989] task: ffff88007519e5c0 ti: ffff880074c08000 task.ti: ffff880074c08000
> [ 178.901989] RIP: 0010:[<ffffffff8110c32b>] [<ffffffff8110c32b>] account_event_cpu+0xbb/0xd0
> [ 178.901989] RSP: 0018:ffff880074c09e38 EFLAGS: 00010216
> [ 178.901989] RAX: 0000000f001cf45a RBX: ffff880074c1a800 RCX: 0000000000100000
> [ 178.901989] RDX: 000000000000002c RSI: 000000000000002c RDI: ffff880000120400
> [ 178.901989] RBP: ffff880074c09e48 R08: 0000000000000000 R09: 00000000ffffffff
> [ 178.901989] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffffffffff
> [ 178.901989] R13: ffff880074c0e640 R14: 00000000ffffffff R15: 0000000000000000
> [ 178.901989] FS: 00007fc670738980(0000) GS:ffff88007a600000(0000) knlGS:0000000000000000
> [ 178.901989] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 178.901989] CR2: 0000000f001cf45a CR3: 00000000751a8000 CR4: 00000000000407f0
> [ 178.901989] Stack:
> [ 178.901989] ffff880074c1a800 ffff880074c09ee8 ffff880074c09e68 ffffffff8110c3b6
> [ 178.901989] 00000000ffffffff ffff880074c1a800 ffff880074c09f78 ffffffff81113ae8
> [ 178.901989] 0000000000000000 0000000000000001 ffff880074c09eb8 ffffffff00000060
> [ 178.901989] Call Trace:
> [ 178.901989] [<ffffffff8110c3b6>] account_event.part.66+0x76/0xa0
> [ 178.901989] [<ffffffff81113ae8>] SyS_perf_event_open+0x678/0xdf0
> [ 178.901989] [<ffffffff8100ed42>] ? syscall_trace_leave+0xb2/0x240
> [ 178.901989] [<ffffffff81641512>] tracesys+0xd0/0xd5
> [ 178.901989] Code: d4 48 c7 c0 28 f4 1c 00 48 03 04 d5 80 ad ce 81 f0 ff 00 f6 83 c9 00 00 00 04 74 a7 48 c7 c0 4c f4 1c 00 4a 03 04 e5 80 ad ce 81 <f0> ff 00 5b 41 5c 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
> [ 178.901989] RIP [<ffffffff8110c32b>] account_event_cpu+0xbb/0xd0
> [ 178.901989] RSP <ffff880074c09e38>
> [ 178.901989] CR2: 0000000f001cf45a
> [ 179.122749] ---[ end trace 6f2f4f69b01368fb ]---

the testcase was:

[root@dhcp-26-214 tracing]# cat
...
[root@dhcp-26-214 perf]# ./perf record -p `pgrep cat`
Killed

jirka

2013-08-01 13:02:17

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 2/8] perf: Sanitize get_callchain_buffer()

On Tue, Jul 23, 2013 at 02:31:00AM +0200, Frederic Weisbecker wrote:
> In case of allocation failure, get_callchain_buffer() keeps the
> refcount incremented for the current event.
>
> As a result, when get_callchain_buffers() returns an error,
> we must cleanup what it did by cancelling its last refcount
> with a call to put_callchain_buffers().
>
> This is a hack in order to be able to call free_event()
> after that failure.
>
> The original purpose of that was to simplify the failure
> path. But this error handling is actually counter intuitive,
> ugly and not very easy to follow because one expect to
> see the resources used to perform a service to be cleaned
> by the callee if case of failure, not by the caller.
>
> So lets clean this up by cancelling the refcount from
> get_callchain_buffer() in case of failure. And correctly free
> the event accordingly in perf_event_alloc().
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Jiri Olsa <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Stephane Eranian <[email protected]>
> ---
> kernel/events/callchain.c | 2 ++
> kernel/events/core.c | 41 +++++++++++++++++++++--------------------
> 2 files changed, 23 insertions(+), 20 deletions(-)
>
> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index c772061..76a8bc5 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c
> @@ -117,6 +117,8 @@ int get_callchain_buffers(void)
> err = alloc_callchain_buffers();
> exit:
> mutex_unlock(&callchain_mutex);
> + if (err)
> + atomic_dec(&nr_callchain_events);

shouldn't we touch this under above lock?

also that above hunk decrements nr_callchain_events
also for following case:

count = atomic_inc_return(&nr_callchain_events);
if (WARN_ON_ONCE(count < 1)) {
err = -EINVAL;
goto exit;
}

seems like it screws the count

jirka

2013-08-01 13:14:00

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 3/8] perf: Gather event accounting code

On Tue, Jul 23, 2013 at 02:31:01AM +0200, Frederic Weisbecker wrote:
> Gather all the event accounting code to a single place,
> once all the prerequisites are completed. This simplifies
> the refcounting.
>
> Original-patch-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Jiri Olsa <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Stephane Eranian <[email protected]>
> ---
> kernel/events/core.c | 79 +++++++++++++++++++++++++++++--------------------
> 1 files changed, 47 insertions(+), 32 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index c5f435f..3bb73af 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3128,6 +3128,21 @@ static void free_event_rcu(struct rcu_head *head)
> static void ring_buffer_put(struct ring_buffer *rb);
> static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);
>
> +static void __free_event(struct perf_event *event)
> +{
> + if (!event->parent) {
> + if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
> + put_callchain_buffers();
> + }
> +
> + if (event->destroy)
> + event->destroy(event);
> +
> + if (event->ctx)
> + put_ctx(event->ctx);
> +
> + call_rcu(&event->rcu_head, free_event_rcu);
> +}
> static void free_event(struct perf_event *event)
> {

nitpick, missing nl between functions

> irq_work_sync(&event->pending);
> @@ -3141,8 +3156,6 @@ static void free_event(struct perf_event *event)
> atomic_dec(&nr_comm_events);
> if (event->attr.task)
> atomic_dec(&nr_task_events);
> - if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
> - put_callchain_buffers();
> if (is_cgroup_event(event)) {
> atomic_dec(&per_cpu(perf_cgroup_events, event->cpu));
> static_key_slow_dec_deferred(&perf_sched_events);
> @@ -3180,13 +3193,8 @@ static void free_event(struct perf_event *event)
> if (is_cgroup_event(event))
> perf_detach_cgroup(event);
>
> - if (event->destroy)
> - event->destroy(event);
> -
> - if (event->ctx)
> - put_ctx(event->ctx);
>
> - call_rcu(&event->rcu_head, free_event_rcu);
> + __free_event(event);

nitpick, extra nl above

2013-08-01 13:28:40

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/8] perf: Sanitize get_callchain_buffer()

On Thu, Aug 01, 2013 at 03:01:46PM +0200, Jiri Olsa wrote:
> On Tue, Jul 23, 2013 at 02:31:00AM +0200, Frederic Weisbecker wrote:
> > In case of allocation failure, get_callchain_buffer() keeps the
> > refcount incremented for the current event.
> >
> > As a result, when get_callchain_buffers() returns an error,
> > we must cleanup what it did by cancelling its last refcount
> > with a call to put_callchain_buffers().
> >
> > This is a hack in order to be able to call free_event()
> > after that failure.
> >
> > The original purpose of that was to simplify the failure
> > path. But this error handling is actually counter intuitive,
> > ugly and not very easy to follow because one expect to
> > see the resources used to perform a service to be cleaned
> > by the callee if case of failure, not by the caller.
> >
> > So lets clean this up by cancelling the refcount from
> > get_callchain_buffer() in case of failure. And correctly free
> > the event accordingly in perf_event_alloc().
> >
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > Cc: Jiri Olsa <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Namhyung Kim <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Arnaldo Carvalho de Melo <[email protected]>
> > Cc: Stephane Eranian <[email protected]>
> > ---
> > kernel/events/callchain.c | 2 ++
> > kernel/events/core.c | 41 +++++++++++++++++++++--------------------
> > 2 files changed, 23 insertions(+), 20 deletions(-)
> >
> > diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> > index c772061..76a8bc5 100644
> > --- a/kernel/events/callchain.c
> > +++ b/kernel/events/callchain.c
> > @@ -117,6 +117,8 @@ int get_callchain_buffers(void)
> > err = alloc_callchain_buffers();
> > exit:
> > mutex_unlock(&callchain_mutex);
> > + if (err)
> > + atomic_dec(&nr_callchain_events);
>
> shouldn't we touch this under above lock?

Right, we should move that under the lock, or another user of the callchains may see that we failed
the allocation and simply give up instead of retrying.

>
> also that above hunk decrements nr_callchain_events
> also for following case:
>
> count = atomic_inc_return(&nr_callchain_events);
> if (WARN_ON_ONCE(count < 1)) {
> err = -EINVAL;
> goto exit;
> }
>
> seems like it screws the count

I'm not sure what you mean here. You mean that it could be negative when the dec is done
outside the lock?

>
> jirka

2013-08-01 13:30:09

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/8] perf: Gather event accounting code

On Thu, Aug 01, 2013 at 03:13:30PM +0200, Jiri Olsa wrote:
> On Tue, Jul 23, 2013 at 02:31:01AM +0200, Frederic Weisbecker wrote:
> > Gather all the event accounting code to a single place,
> > once all the prerequisites are completed. This simplifies
> > the refcounting.
> >
> > Original-patch-by: Peter Zijlstra <[email protected]>
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > Cc: Jiri Olsa <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Namhyung Kim <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Arnaldo Carvalho de Melo <[email protected]>
> > Cc: Stephane Eranian <[email protected]>
> > ---
> > kernel/events/core.c | 79 +++++++++++++++++++++++++++++--------------------
> > 1 files changed, 47 insertions(+), 32 deletions(-)
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index c5f435f..3bb73af 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -3128,6 +3128,21 @@ static void free_event_rcu(struct rcu_head *head)
> > static void ring_buffer_put(struct ring_buffer *rb);
> > static void ring_buffer_detach(struct perf_event *event, struct ring_buffer *rb);
> >
> > +static void __free_event(struct perf_event *event)
> > +{
> > + if (!event->parent) {
> > + if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
> > + put_callchain_buffers();
> > + }
> > +
> > + if (event->destroy)
> > + event->destroy(event);
> > +
> > + if (event->ctx)
> > + put_ctx(event->ctx);
> > +
> > + call_rcu(&event->rcu_head, free_event_rcu);
> > +}
> > static void free_event(struct perf_event *event)
> > {
>
> nitpick, missing nl between functions
>
> > irq_work_sync(&event->pending);
> > @@ -3141,8 +3156,6 @@ static void free_event(struct perf_event *event)
> > atomic_dec(&nr_comm_events);
> > if (event->attr.task)
> > atomic_dec(&nr_task_events);
> > - if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
> > - put_callchain_buffers();
> > if (is_cgroup_event(event)) {
> > atomic_dec(&per_cpu(perf_cgroup_events, event->cpu));
> > static_key_slow_dec_deferred(&perf_sched_events);
> > @@ -3180,13 +3193,8 @@ static void free_event(struct perf_event *event)
> > if (is_cgroup_event(event))
> > perf_detach_cgroup(event);
> >
> > - if (event->destroy)
> > - event->destroy(event);
> > -
> > - if (event->ctx)
> > - put_ctx(event->ctx);
> >
> > - call_rcu(&event->rcu_head, free_event_rcu);
> > + __free_event(event);
>
> nitpick, extra nl above


Ok, I'll fix, thanks!

2013-08-01 13:32:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/8] perf: Account freq events per cpu

On Thu, Aug 01, 2013 at 02:46:58PM +0200, Jiri Olsa wrote:
> On Tue, Jul 23, 2013 at 02:31:04AM +0200, Frederic Weisbecker wrote:
> > This is going to be used by the full dynticks subsystem
> > as a finer-grained information to know when to keep and
> > when to stop the tick.
> >
> > Original-patch-by: Peter Zijlstra <[email protected]>
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > Cc: Jiri Olsa <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Namhyung Kim <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Arnaldo Carvalho de Melo <[email protected]>
> > Cc: Stephane Eranian <[email protected]>
> > ---
> > kernel/events/core.c | 7 +++++++
> > 1 files changed, 7 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index b40c3db..f9bd39b 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -141,6 +141,7 @@ enum event_type_t {
> > struct static_key_deferred perf_sched_events __read_mostly;
> > static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
> > static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
> > +static DEFINE_PER_CPU(atomic_t, perf_freq_events);
> >
> > static atomic_t nr_mmap_events __read_mostly;
> > static atomic_t nr_comm_events __read_mostly;
> > @@ -3139,6 +3140,9 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
> > }
> > if (is_cgroup_event(event))
> > atomic_dec(&per_cpu(perf_cgroup_events, cpu));
> > +
> > + if (event->attr.freq)
> > + atomic_dec(&per_cpu(perf_freq_events, cpu));
> > }
> >
> > static void unaccount_event(struct perf_event *event)
> > @@ -6473,6 +6477,9 @@ static void account_event_cpu(struct perf_event *event, int cpu)
> > }
> > if (is_cgroup_event(event))
> > atomic_inc(&per_cpu(perf_cgroup_events, cpu));
> > +
> > + if (event->attr.freq)
> > + atomic_inc(&per_cpu(perf_freq_events, cpu));
>
> cpu could be -1 in here.. getting:

Ho humm, right you are.

So we have:

static void account_event_cpu(struct perf_event *event, int cpu)
{
if (event->parent)
return;

if (has_branch_stack(event)) {
if (!(event->attach_state & PERF_ATTACH_TASK))
atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
}
if (is_cgroup_event(event))
atomic_inc(&per_cpu(perf_cgroup_events, cpu));

if (event->attr.freq)
atomic_inc(&per_cpu(perf_freq_events, cpu));
}

Where the freq thing is new and shiney, but we already had the other
two. Of those, cgroup events must be per cpu so that should be good,
the branch_stack thing tests ATTACH_TASK, which should also be good, but
leaves me wonder wth they do for those that are attached to tasks.

But yes, the frequency thing is borken.

2013-08-01 13:32:49

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 2/8] perf: Sanitize get_callchain_buffer()

On Thu, Aug 01, 2013 at 03:28:34PM +0200, Frederic Weisbecker wrote:

SNIP

> > also for following case:
> >
> > count = atomic_inc_return(&nr_callchain_events);
> > if (WARN_ON_ONCE(count < 1)) {
> > err = -EINVAL;
> > goto exit;
> > }
> >
> > seems like it screws the count
>
> I'm not sure what you mean here. You mean that it could be negative when the dec is done
> outside the lock?

yes

2013-08-01 13:33:23

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 2/8] perf: Sanitize get_callchain_buffer()

On Tue, Jul 23, 2013 at 02:31:00AM +0200, Frederic Weisbecker wrote:
SNIP

> if (event->attach_state & PERF_ATTACH_TASK)
> static_key_slow_inc(&perf_sched_events.key);
> if (event->attr.mmap || event->attr.mmap_data)
> @@ -6572,16 +6570,19 @@ done:
> atomic_inc(&per_cpu(perf_branch_stack_events,
> event->cpu));
> }
> - if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
> - err = get_callchain_buffers();
> - if (err) {
> - free_event(event);
> - return ERR_PTR(err);
> - }
> - }
> }
>
> return event;
> +
> +err_pmu:
> + if (event->destroy)
> + event->destroy(event);
> +err_ns:
> + if (event->ns)
> + put_pid_ns(event->ns);
> + kfree(event);
> +
> + return ERR_PTR(err);

could we call __free_filter(event) here?

jirka

2013-08-01 13:35:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/8] perf: Account freq events per cpu

On Thu, Aug 01, 2013 at 03:31:55PM +0200, Peter Zijlstra wrote:
> the branch_stack thing tests ATTACH_TASK, which should also be good, but
> leaves me wonder wth they do for those that are attached to tasks.

Ah found it, in intel_pmu_lbr_enable() we test cpuc->lbr_context against
event->ctx and wipe the LBR if they differ, that should indeed avoid the
leak but isn't sufficient to guarantee sane data as the context swap for
inherited counters can foil this.

2013-08-01 13:39:52

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 6/8] perf: Account freq events per cpu

On Thu, Aug 01, 2013 at 03:31:55PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 01, 2013 at 02:46:58PM +0200, Jiri Olsa wrote:
> > On Tue, Jul 23, 2013 at 02:31:04AM +0200, Frederic Weisbecker wrote:
> > > This is going to be used by the full dynticks subsystem
> > > as a finer-grained information to know when to keep and
> > > when to stop the tick.
> > >
> > > Original-patch-by: Peter Zijlstra <[email protected]>
> > > Signed-off-by: Frederic Weisbecker <[email protected]>
> > > Cc: Jiri Olsa <[email protected]>
> > > Cc: Peter Zijlstra <[email protected]>
> > > Cc: Namhyung Kim <[email protected]>
> > > Cc: Ingo Molnar <[email protected]>
> > > Cc: Arnaldo Carvalho de Melo <[email protected]>
> > > Cc: Stephane Eranian <[email protected]>
> > > ---
> > > kernel/events/core.c | 7 +++++++
> > > 1 files changed, 7 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index b40c3db..f9bd39b 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -141,6 +141,7 @@ enum event_type_t {
> > > struct static_key_deferred perf_sched_events __read_mostly;
> > > static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
> > > static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
> > > +static DEFINE_PER_CPU(atomic_t, perf_freq_events);
> > >
> > > static atomic_t nr_mmap_events __read_mostly;
> > > static atomic_t nr_comm_events __read_mostly;
> > > @@ -3139,6 +3140,9 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
> > > }
> > > if (is_cgroup_event(event))
> > > atomic_dec(&per_cpu(perf_cgroup_events, cpu));
> > > +
> > > + if (event->attr.freq)
> > > + atomic_dec(&per_cpu(perf_freq_events, cpu));
> > > }
> > >
> > > static void unaccount_event(struct perf_event *event)
> > > @@ -6473,6 +6477,9 @@ static void account_event_cpu(struct perf_event *event, int cpu)
> > > }
> > > if (is_cgroup_event(event))
> > > atomic_inc(&per_cpu(perf_cgroup_events, cpu));
> > > +
> > > + if (event->attr.freq)
> > > + atomic_inc(&per_cpu(perf_freq_events, cpu));
> >
> > cpu could be -1 in here.. getting:
>
> Ho humm, right you are.
>
> So we have:
>
> static void account_event_cpu(struct perf_event *event, int cpu)
> {
> if (event->parent)
> return;
>
> if (has_branch_stack(event)) {
> if (!(event->attach_state & PERF_ATTACH_TASK))
> atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
> }
> if (is_cgroup_event(event))
> atomic_inc(&per_cpu(perf_cgroup_events, cpu));
>
> if (event->attr.freq)
> atomic_inc(&per_cpu(perf_freq_events, cpu));
> }
>
> Where the freq thing is new and shiney, but we already had the other
> two. Of those, cgroup events must be per cpu so that should be good,
> the branch_stack thing tests ATTACH_TASK, which should also be good, but
> leaves me wonder wth they do for those that are attached to tasks.

cgroup is cpu only:

SYSCALL(..
if ((flags & PERF_FLAG_PID_CGROUP) && (pid == -1 || cpu == -1))
return -EINVAL;

jirka

2013-08-01 13:42:34

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/8] perf: Sanitize get_callchain_buffer()

On Thu, Aug 01, 2013 at 03:29:34PM +0200, Jiri Olsa wrote:
> On Tue, Jul 23, 2013 at 02:31:00AM +0200, Frederic Weisbecker wrote:
> SNIP
>
> > if (event->attach_state & PERF_ATTACH_TASK)
> > static_key_slow_inc(&perf_sched_events.key);
> > if (event->attr.mmap || event->attr.mmap_data)
> > @@ -6572,16 +6570,19 @@ done:
> > atomic_inc(&per_cpu(perf_branch_stack_events,
> > event->cpu));
> > }
> > - if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
> > - err = get_callchain_buffers();
> > - if (err) {
> > - free_event(event);
> > - return ERR_PTR(err);
> > - }
> > - }
> > }
> >
> > return event;
> > +
> > +err_pmu:
> > + if (event->destroy)
> > + event->destroy(event);
> > +err_ns:
> > + if (event->ns)
> > + put_pid_ns(event->ns);
> > + kfree(event);
> > +
> > + return ERR_PTR(err);
>
> could we call __free_filter(event) here?

Hmm, the filters are installed from ioctl time so there shouldn't be any yet. But there should be
an exception with inherited events. I fail to find where the filter is inherited though. Do
we actually inherit those?

thanks.

2013-08-01 13:49:40

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/8] perf: Sanitize get_callchain_buffer()

On Thu, Aug 01, 2013 at 03:32:17PM +0200, Jiri Olsa wrote:
> On Thu, Aug 01, 2013 at 03:28:34PM +0200, Frederic Weisbecker wrote:
>
> SNIP
>
> > > also for following case:
> > >
> > > count = atomic_inc_return(&nr_callchain_events);
> > > if (WARN_ON_ONCE(count < 1)) {
> > > err = -EINVAL;
> > > goto exit;
> > > }
> > >
> > > seems like it screws the count
> >
> > I'm not sure what you mean here. You mean that it could be negative when the dec is done
> > outside the lock?
>
> yes
>

I don't understand why.

2013-08-01 13:51:35

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 2/8] perf: Sanitize get_callchain_buffer()

On Thu, Aug 01, 2013 at 03:42:28PM +0200, Frederic Weisbecker wrote:
> On Thu, Aug 01, 2013 at 03:29:34PM +0200, Jiri Olsa wrote:
> > On Tue, Jul 23, 2013 at 02:31:00AM +0200, Frederic Weisbecker wrote:
> > SNIP
> >
> > > if (event->attach_state & PERF_ATTACH_TASK)
> > > static_key_slow_inc(&perf_sched_events.key);
> > > if (event->attr.mmap || event->attr.mmap_data)
> > > @@ -6572,16 +6570,19 @@ done:
> > > atomic_inc(&per_cpu(perf_branch_stack_events,
> > > event->cpu));
> > > }
> > > - if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
> > > - err = get_callchain_buffers();
> > > - if (err) {
> > > - free_event(event);
> > > - return ERR_PTR(err);
> > > - }
> > > - }
> > > }
> > >
> > > return event;
> > > +
> > > +err_pmu:
> > > + if (event->destroy)
> > > + event->destroy(event);
> > > +err_ns:
> > > + if (event->ns)
> > > + put_pid_ns(event->ns);
> > > + kfree(event);
> > > +
> > > + return ERR_PTR(err);
> >
> > could we call __free_filter(event) here?
>
> Hmm, the filters are installed from ioctl time so there shouldn't be any yet. But there should be
> an exception with inherited events. I fail to find where the filter is inherited though. Do
> we actually inherit those?

ouch.. last I checked was freeing filter before writing this... :)

what I meant was the __free_event(event)

sorry

jirka

2013-08-01 13:54:30

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH 2/8] perf: Sanitize get_callchain_buffer()

On Thu, Aug 01, 2013 at 03:49:36PM +0200, Frederic Weisbecker wrote:
> On Thu, Aug 01, 2013 at 03:32:17PM +0200, Jiri Olsa wrote:
> > On Thu, Aug 01, 2013 at 03:28:34PM +0200, Frederic Weisbecker wrote:
> >
> > SNIP
> >
> > > > also for following case:
> > > >
> > > > count = atomic_inc_return(&nr_callchain_events);
> > > > if (WARN_ON_ONCE(count < 1)) {
> > > > err = -EINVAL;
> > > > goto exit;
> > > > }
> > > >
> > > > seems like it screws the count
> > >
> > > I'm not sure what you mean here. You mean that it could be negative when the dec is done
> > > outside the lock?
> >
> > yes
> >
>
> I don't understand why.

ah ok, with the count being 0 the nr_callchain_events still
increments right? then it's ok..

jirka

2013-08-01 13:55:33

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 6/8] perf: Account freq events per cpu

On Thu, Aug 01, 2013 at 03:31:55PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 01, 2013 at 02:46:58PM +0200, Jiri Olsa wrote:
> > On Tue, Jul 23, 2013 at 02:31:04AM +0200, Frederic Weisbecker wrote:
> > > This is going to be used by the full dynticks subsystem
> > > as a finer-grained information to know when to keep and
> > > when to stop the tick.
> > >
> > > Original-patch-by: Peter Zijlstra <[email protected]>
> > > Signed-off-by: Frederic Weisbecker <[email protected]>
> > > Cc: Jiri Olsa <[email protected]>
> > > Cc: Peter Zijlstra <[email protected]>
> > > Cc: Namhyung Kim <[email protected]>
> > > Cc: Ingo Molnar <[email protected]>
> > > Cc: Arnaldo Carvalho de Melo <[email protected]>
> > > Cc: Stephane Eranian <[email protected]>
> > > ---
> > > kernel/events/core.c | 7 +++++++
> > > 1 files changed, 7 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index b40c3db..f9bd39b 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -141,6 +141,7 @@ enum event_type_t {
> > > struct static_key_deferred perf_sched_events __read_mostly;
> > > static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
> > > static DEFINE_PER_CPU(atomic_t, perf_branch_stack_events);
> > > +static DEFINE_PER_CPU(atomic_t, perf_freq_events);
> > >
> > > static atomic_t nr_mmap_events __read_mostly;
> > > static atomic_t nr_comm_events __read_mostly;
> > > @@ -3139,6 +3140,9 @@ static void unaccount_event_cpu(struct perf_event *event, int cpu)
> > > }
> > > if (is_cgroup_event(event))
> > > atomic_dec(&per_cpu(perf_cgroup_events, cpu));
> > > +
> > > + if (event->attr.freq)
> > > + atomic_dec(&per_cpu(perf_freq_events, cpu));
> > > }
> > >
> > > static void unaccount_event(struct perf_event *event)
> > > @@ -6473,6 +6477,9 @@ static void account_event_cpu(struct perf_event *event, int cpu)
> > > }
> > > if (is_cgroup_event(event))
> > > atomic_inc(&per_cpu(perf_cgroup_events, cpu));
> > > +
> > > + if (event->attr.freq)
> > > + atomic_inc(&per_cpu(perf_freq_events, cpu));
> >
> > cpu could be -1 in here.. getting:
>
> Ho humm, right you are.
>
> So we have:
>
> static void account_event_cpu(struct perf_event *event, int cpu)
> {
> if (event->parent)
> return;
>
> if (has_branch_stack(event)) {
> if (!(event->attach_state & PERF_ATTACH_TASK))
> atomic_inc(&per_cpu(perf_branch_stack_events, cpu));
> }
> if (is_cgroup_event(event))
> atomic_inc(&per_cpu(perf_cgroup_events, cpu));
>
> if (event->attr.freq)
> atomic_inc(&per_cpu(perf_freq_events, cpu));
> }
>
> Where the freq thing is new and shiney, but we already had the other
> two. Of those, cgroup events must be per cpu so that should be good,
> the branch_stack thing tests ATTACH_TASK, which should also be good, but
> leaves me wonder wth they do for those that are attached to tasks.
>
> But yes, the frequency thing is borken.

Aie, so the freq thing, I can either account to all CPUs (inc to all and send an IPI to all), or
when the event scheds in/out. Probably we should do the former to avoid sending an IPI at all context switches.

2013-08-01 13:56:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/8] perf: Account freq events per cpu

On Thu, Aug 01, 2013 at 03:39:21PM +0200, Jiri Olsa wrote:
> On Thu, Aug 01, 2013 at 03:31:55PM +0200, Peter Zijlstra wrote:
> > two. Of those, cgroup events must be per cpu so that should be good,

> cgroup is cpu only:

jah! :-)

2013-08-01 13:58:05

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/8] perf: Sanitize get_callchain_buffer()

On Thu, Aug 01, 2013 at 03:54:01PM +0200, Jiri Olsa wrote:
> On Thu, Aug 01, 2013 at 03:49:36PM +0200, Frederic Weisbecker wrote:
> > On Thu, Aug 01, 2013 at 03:32:17PM +0200, Jiri Olsa wrote:
> > > On Thu, Aug 01, 2013 at 03:28:34PM +0200, Frederic Weisbecker wrote:
> > >
> > > SNIP
> > >
> > > > > also for following case:
> > > > >
> > > > > count = atomic_inc_return(&nr_callchain_events);
> > > > > if (WARN_ON_ONCE(count < 1)) {
> > > > > err = -EINVAL;
> > > > > goto exit;
> > > > > }
> > > > >
> > > > > seems like it screws the count
> > > >
> > > > I'm not sure what you mean here. You mean that it could be negative when the dec is done
> > > > outside the lock?
> > >
> > > yes
> > >
> >
> > I don't understand why.
>
> ah ok, with the count being 0 the nr_callchain_events still
> increments right? then it's ok..

Yep. But I still need to move the dec into the lock though.

Thanks for catching this!

2013-08-01 14:03:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/8] perf: Account freq events per cpu

On Thu, Aug 01, 2013 at 03:55:27PM +0200, Frederic Weisbecker wrote:
> On Thu, Aug 01, 2013 at 03:31:55PM +0200, Peter Zijlstra wrote:
> >
> > Where the freq thing is new and shiney, but we already had the other
> > two. Of those, cgroup events must be per cpu so that should be good,
> > the branch_stack thing tests ATTACH_TASK, which should also be good, but
> > leaves me wonder wth they do for those that are attached to tasks.
> >
> > But yes, the frequency thing is borken.
>
> Aie, so the freq thing, I can either account to all CPUs (inc to all
> and send an IPI to all), or when the event scheds in/out. Probably we
> should do the former to avoid sending an IPI at all context switches.

Yeah, just go with a single global state for now..

2013-08-01 14:06:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/8] perf: Account freq events per cpu

On Thu, Aug 01, 2013 at 04:03:52PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 01, 2013 at 03:55:27PM +0200, Frederic Weisbecker wrote:
> > On Thu, Aug 01, 2013 at 03:31:55PM +0200, Peter Zijlstra wrote:
> > >
> > > Where the freq thing is new and shiney, but we already had the other
> > > two. Of those, cgroup events must be per cpu so that should be good,
> > > the branch_stack thing tests ATTACH_TASK, which should also be good, but
> > > leaves me wonder wth they do for those that are attached to tasks.
> > >
> > > But yes, the frequency thing is borken.
> >
> > Aie, so the freq thing, I can either account to all CPUs (inc to all
> > and send an IPI to all), or when the event scheds in/out. Probably we
> > should do the former to avoid sending an IPI at all context switches.
>
> Yeah, just go with a single global state for now..

The perf default is to create inherited counter, which are per cpu
anyway. So we'll not loose much.

2013-08-01 14:19:24

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 6/8] perf: Account freq events per cpu

On Thu, Aug 01, 2013 at 04:03:52PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 01, 2013 at 03:55:27PM +0200, Frederic Weisbecker wrote:
> > On Thu, Aug 01, 2013 at 03:31:55PM +0200, Peter Zijlstra wrote:
> > >
> > > Where the freq thing is new and shiney, but we already had the other
> > > two. Of those, cgroup events must be per cpu so that should be good,
> > > the branch_stack thing tests ATTACH_TASK, which should also be good, but
> > > leaves me wonder wth they do for those that are attached to tasks.
> > >
> > > But yes, the frequency thing is borken.
> >
> > Aie, so the freq thing, I can either account to all CPUs (inc to all
> > and send an IPI to all), or when the event scheds in/out. Probably we
> > should do the former to avoid sending an IPI at all context switches.
>
> Yeah, just go with a single global state for now..

A single global counter? Ok sounds good.

2013-08-01 14:21:15

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 6/8] perf: Account freq events per cpu

On Thu, Aug 01, 2013 at 04:06:15PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 01, 2013 at 04:03:52PM +0200, Peter Zijlstra wrote:
> > On Thu, Aug 01, 2013 at 03:55:27PM +0200, Frederic Weisbecker wrote:
> > > On Thu, Aug 01, 2013 at 03:31:55PM +0200, Peter Zijlstra wrote:
> > > >
> > > > Where the freq thing is new and shiney, but we already had the other
> > > > two. Of those, cgroup events must be per cpu so that should be good,
> > > > the branch_stack thing tests ATTACH_TASK, which should also be good, but
> > > > leaves me wonder wth they do for those that are attached to tasks.
> > > >
> > > > But yes, the frequency thing is borken.
> > >
> > > Aie, so the freq thing, I can either account to all CPUs (inc to all
> > > and send an IPI to all), or when the event scheds in/out. Probably we
> > > should do the former to avoid sending an IPI at all context switches.
> >
> > Yeah, just go with a single global state for now..
>
> The perf default is to create inherited counter, which are per cpu
> anyway. So we'll not loose much.

So you mean that I keep the per cpu state when event->cpu != -1 and also have
a global counter for the others. Right?

2013-08-01 14:30:13

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 2/8] perf: Sanitize get_callchain_buffer()

On Thu, Aug 01, 2013 at 03:51:02PM +0200, Jiri Olsa wrote:
> On Thu, Aug 01, 2013 at 03:42:28PM +0200, Frederic Weisbecker wrote:
> > On Thu, Aug 01, 2013 at 03:29:34PM +0200, Jiri Olsa wrote:
> > > On Tue, Jul 23, 2013 at 02:31:00AM +0200, Frederic Weisbecker wrote:
> > > SNIP
> > >
> > > > if (event->attach_state & PERF_ATTACH_TASK)
> > > > static_key_slow_inc(&perf_sched_events.key);
> > > > if (event->attr.mmap || event->attr.mmap_data)
> > > > @@ -6572,16 +6570,19 @@ done:
> > > > atomic_inc(&per_cpu(perf_branch_stack_events,
> > > > event->cpu));
> > > > }
> > > > - if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
> > > > - err = get_callchain_buffers();
> > > > - if (err) {
> > > > - free_event(event);
> > > > - return ERR_PTR(err);
> > > > - }
> > > > - }
> > > > }
> > > >
> > > > return event;
> > > > +
> > > > +err_pmu:
> > > > + if (event->destroy)
> > > > + event->destroy(event);
> > > > +err_ns:
> > > > + if (event->ns)
> > > > + put_pid_ns(event->ns);
> > > > + kfree(event);
> > > > +
> > > > + return ERR_PTR(err);
> > >
> > > could we call __free_filter(event) here?
> >
> > Hmm, the filters are installed from ioctl time so there shouldn't be any yet. But there should be
> > an exception with inherited events. I fail to find where the filter is inherited though. Do
> > we actually inherit those?
>
> ouch.. last I checked was freeing filter before writing this... :)
>
> what I meant was the __free_event(event)

free_event() doesn't work either because we want several level of rollback depending
of where the error triggered:

+err_pmu:
if (event->destroy)
event->destroy(event);
+err_ns:
if (event->ns)
put_pid_ns(event->ns);
kfree(event);

return ERR_PTR(err)

If we fail after pmu init we want to call destroy, free pid ns and the event.
If we fail before the pmu init, we want to only free pid ns and the event, ...

_free_event() does the whole in any case, which is not what we want.

But...

OTOH it might work due to the if (event->destroy) and if (event->ns) before freeing the
resource associated.

So may be I can replace the labels with a single call to __free_event() after all as it
checks what needs to be freed. What do you think?

2013-08-01 14:40:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/8] perf: Account freq events per cpu

On Thu, Aug 01, 2013 at 04:21:11PM +0200, Frederic Weisbecker wrote:
> > > Yeah, just go with a single global state for now..
> >
> > The perf default is to create inherited counter, which are per cpu
> > anyway. So we'll not loose much.
>
> So you mean that I keep the per cpu state when event->cpu != -1 and also have
> a global counter for the others. Right?

No, only a single global counter. The fact that perf, by default,
creates a counter per cpu, means that there's little effective
difference between a single global counter and per-cpu counters.

2013-08-02 16:26:00

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 6/8] perf: Account freq events per cpu

On Thu, Aug 01, 2013 at 04:40:05PM +0200, Peter Zijlstra wrote:
> On Thu, Aug 01, 2013 at 04:21:11PM +0200, Frederic Weisbecker wrote:
> > > > Yeah, just go with a single global state for now..
> > >
> > > The perf default is to create inherited counter, which are per cpu
> > > anyway. So we'll not loose much.
> >
> > So you mean that I keep the per cpu state when event->cpu != -1 and also have
> > a global counter for the others. Right?
>
> No, only a single global counter. The fact that perf, by default,
> creates a counter per cpu, means that there's little effective
> difference between a single global counter and per-cpu counters.

Good point! I just included that in the changelog.

Thanks.